Methods and apparatus for DNA sequencing and DNA identification

ABSTRACT

Sequencing by Hybridization (SBH) methods and apparatus employing subdivided filters for discrete multiple probe analysis of multiple samples may be used for DNA identification and for DNA sequencing. Partitioned filters are prepared. Samples are affixed to sections of partitioned filters and each sector is probed with a single probe or a multiplexed probe for hybridization scoring. Hybridization data is analyzed for probe complementarity, partial sequencing by SBH or complete sequencing by SBH.

FIELD OF THE INVENTION

[0001] This invention relates in general to methods and apparatus fornucleic acid analysis, and, in particular to, methods and apparatus forDNA sequencing.

BACKGROUND

[0002] The rate of determining the sequence of the four nucleotides inDNA samples is a major technical obstacle for further advancement ofmolecular biology, medicine, and biotechnology. Nucleic acid sequencingmethods which involve separation of DNA molecules in a gel have been inuse since 1978. The only other proven method for sequencing nucleicacids is sequencing by hybridization (SBH).

[0003] The array-based approach of SBH does not require, single baseresolution in separation, degradation, synthesis or imaging of a DNAmolecule. In the most commonly discussed variation of this method, usingmismatch discriminative hybridization of short oligonucleotides K basesin length, lists of constituent K-mer oligonucleotides may be determinedfor target DNA. The sequence may be assembled through uniquelyoverlapping scored oligonucleotides.

[0004] In SBH sequence assembly, K−1 oligonucleotides which occurrepeatedly in analyzed DNA fragments due to chance or biological reasonsmay be subject to special consideration. If there is no additionalinformation, relatively small fragments of DNA may be fully assembled inas much as every base pair (bp) is read several times. In assembly ofrelatively longer fragments, ambiguities may arise due to repeatedoccurrence of a K−1 nucleotide. This problem does not exist if mutatedor similar sequences have to be determined. Knowledge of one sequencemay be used as a template to correctly assemble a similar one.

[0005] There are several approaches for sequencing by hybridization. InSBH Format 1, DNA samples are arrayed and labelled probes are hybridizedwith the samples. Replica membranes with the same sets of sample DNAsmay be used for parallel scoring of several probes and/or probes may bemultiplexed. Arraying and hybridization of DNA samples on the nylonmembranes are well developed. Each array may be reused many times.Format 1 is especially efficient for batch processing large numbers ofsamples.

[0006] In SBH Format 2, probes are arrayed and a labelled DNA samplefragment is hybridized to the arrayed probes. In this case, the completesequence of one fragment may be determined from simultaneoushybridization reactions with the arrayed probes. For sequencing otherDNA fragments, the same oligonucleotide array may be reused. The arraysmay be produced by spotting or in situ variant of Format 2, DNA anchorsare arrayed and ligation is used to determine oligosequences presentsynthesis. Specific hybridization has been demonstrated. In a variant ofFormat 2, DNA anchors are arrayed and ligation is used to determineoligosequences present at the end of target DNA.

[0007] In Format 3, two sets of probes are used. One set may be in theform of arrays and another, labelled set is stored in multiwell plates.In this case, target DNA need not be labelled. Target DNA and onelabelled probe are added to the arrayed set of probes. If one attachedprobe and one labelled probe both hybridize contiguously on the targetDNA, they are covalently ligated, producing a sequence twice as long tobe scored. The process allows for sequencing long DNA fragments, e.g. acomplete bacterial genome, without DNA subcloning in smaller pieces.

[0008] In the present invention, SBH is applied to the efficientidentification and sequencing one or more DNA samples in a short periodof time. The procedure has many applications in DNA diagnostics,forensics, and gene mapping. It also may be used to identify mutationsresponsible for genetic disorders and other traits, to assessbiodiversity and to produce many other types of data dependent on DNAsequence.

SUMMARY OF THE INVENTION

[0009] As mentioned above, Format 1 SBH is appropriate for thesimultaneous analysis of a large set of samples. Parallel scoring ofthousands of samples on large arrays may be applied to one or a fewsamples are in thousands of independent hybridization reactions usingsmall pieces of membranes. The identification of DNA may involve 1-20probes and the identification of mutations may in some cases involvemore than 1000 probes specifically selected or designed for each sample.For identification of the nature of the mutated DNA segments, specificprobes may be synthesized or selected for each mutation detected in thefirst round of hybridizations.

[0010] According to the present invention, DNA samples may be preparedin small arrays which may be separated by appropriate spacers, and whichmay be simultaneously tested with probes selected from a set ofoligonucleotides kept in multiwell plates. Small arrays may consist ofone or more samples. DNA samples in each small array may consist ofmutants or individual samples of a sequence. Consecutive small arrayswhich form larger arrays may represent either replication of the samearray or samples of a different DNA fragment. A universal set of probesconsists of sufficient probes to analyze any DNA fragment withprespecified precision, e.g. with respect to the redundancy of readingeach bp. These sets may include more probes than are necessary for onespecific fragment, but fewer than are necessary for testing thousands ofDNA samples of different sequence.

[0011] DNA or allele identification and a diagnostic sequencing processmay include the steps of:

[0012] 1) Selection of a subset of probes from a dedicated,representative or universal set to be hybridized with each of aplurality small arrays;

[0013] 2) Adding a first probe to each subarray on each of the arrays tobe analyzed in parallel;

[0014] 3) Performing hybridization and scoring of the hybridizationresults;

[0015] 4) Stripping off previously used probes and repeating remainingprobes that are to be scored;

[0016] 5) Processing the obtained results to obtain a final analysis orto determine additional probes to be hybridized;

[0017] 6) Performing additional hybridizations for certain subarrays;and

[0018] 7) Processing complete sets of data and computing obtaining afinal analysis.

[0019] The present invention solves problems in fast identification andsequencing of a small number of nucleic acid samples of one type (e.g.DNA, RNA) and in parallel analysis of many sample types by using apresynthesized set of probes of manageable size and samples attached toa support in the form of subarrays. Two approaches have been combined toproduce an efficient and versatile process for the determination of DNAidentity, for DNA diagnostics, and for identification of mutations. Forthe identification of known sequences a small set of shorter probes maybe used in place of a longer unique probe. In this case, there may bemore probes to be scored, but a universal set of probes may besynthesized to cover any type of sequence. For example, a full set of6-mers or 7-mers are only 4,096 and 16,384 probes, respectively.

[0020] Full sequencing of a DNA fragment may involve two levels. Onelevel is hybridization of a sufficient set of probes that cover everybase at least once. For this purpose, a specific set of probes may besynthesized for a standard sample. This hybridization data revealswhether and where mutations (differences) occur in non-standard samples.To determine the identity of the changes, additional specific probes maybe hybridized to the sample. In another embodiment, all probes from auniversal set may be scored.

[0021] A universal set of probes allows scoring of a relatively smallnumber of probes per sample in a two step-process without unacceptableexpenditure of time. The hybridization process involves successiveprobings, in a first step of computing an optimal subset of probes to behybridized first and, then, on the basis of the obtained results, asecond step of determining additional probes to be scored from amongthose in the existing universal set.

[0022] The use of an array of sample arrays avoids consecutive scoringof many oligonucleotides on a single sample or on a small set ofsamples. This approach allows the scoring of more probes in parallel bymanipulation of only one physical object. By combining the use of thesubarray formed with the universal set of probes and the four stephybridization process, a DNA sample 1000 bp in length may be sequencedin a relatively short period of time. If the sample is spotted at 50subarrays in an array and the array is reprobed 10 times, 500 probes maybe scored. This number of probes is highly sufficient. In screening forthe occurrence of a mutation, approximately 335 probes may be used tocover each base three times. If a mutation is present, several coveringprobes will be affected. These negative probes may map the mutation witha two base precision. To solve a single base mutation mapped with thisprecision, an additional 15 probes may be employed. These probes coverany base combination for the two questionable positions (assuming thatdeletions and insertions are not involved). These probes may be scoredin one cycle on 50 subarrays which contain the given sample. In theimplementation of a multiple label color scheme (multiplexing), two tosix probes labelled with different fluorescent dyes may be used as apool, thereby reducing the number of hybridization cycles and shorteningthe sequencing process.

[0023] In more complicated cases, there may be two close mutations orinsertions. They may be handled with more probes. For example, a threebase insertion may be solved with 64 probes. The most complicated casesmay be approached by several steps of hybridization, and the selectingof a new set of probes on the basis of results of previoushybridizations.

[0024] If subarrays consists of tens or hundreds of samples of one type,then several of them may be found to contain one or more changes(mutations, insertions, or deletions). For each segment where mutationoccurs, a specific set of probes may be scored. The total number ofprobes to be scored for a type of sample may be several hundreds. Thescoring of replica arrays in parallel allow scoring of hundreds ofprobes in a relatively small number of cycles. In addition, compatibleprobes may be pooled. Positive hybridizations may be assigned to theprobes selected to check particular DNA segments because these segmentsusually differ in 75% of their constituent bases.

[0025] By using a larger set of longer probes, longer targets may beconveniently analyzed. These targets may represent pools of shorterfragments such as pools of exon clones.

[0026] The multiple step approach, which minimizes the number ofnecessary probes, may employ a specific hybridization scoring method todefine the presence of heterozygotes (sequence variants) in a genomicsegment to be sequenced from a diploid chromosomal set. There are twopossibilities: i) the sequence from one chromosome represents a basictype and the sequence from the other represents a new variant; or, ii)both chromosomes contain new, but different variants. In the first case,the scanning step designed to map changes gives a maximal signaldifference of two-fold at the heterozygotic position. In the secondcase, there is no masking; only a more complicated selection of theprobes for the subsequent rounds of hybridizations may be required.

[0027] Scoring two-fold signal differences required in the first casemay be achieved efficiently by comparing corresponding signals withcontrols containing only the basic sequence type and with the signalsfrom other analyzed samples. This approach allows determination of arelative reduction in the hybridization signal for each particular probein the given sample. This is significant because hybridizationefficiency may vary more than two-fold for a particular probe hybridizedwith different DNA fragments having its full match target. In addition,heterozygotic sites may affect more than one probe depending on thenumber of oligonucleotide probes. Decrease of the signal for two to fourconsecutive probes produces a more significant indication ofheterozygotic sites. The leads may be checked by small sets of selectedprobes among which one or few probes are suppose to give full matchsignal which is on average eight-fold stronger than the signals comingfrom mismatch-containing duplexes.

[0028] Partitioned membranes allow a very flexible organization ofexperiments to accommodate relatively larger numbers of samplesrepresenting a given sequence type, or many different types of samplesrepresented with smaller number of samples. A range of 4-256 samples canbe handled with particular efficiency. Subarrays within this range ofnumbers of dots may be designed to match the configuration and size ofstandard multiwell plates used for storing and labellingoligonucleotides. The size of the subarrays may be adjusted fordifferent number of samples, or a few standard subarray sizes may beused. If all samples of one type do not fit in one subarray, additionalsubarrays or membranes may be used and processed with the same probes.In addition, by adjusting the number of replicas for each subarray, thetime for completion of identification or sequencing process may bevaried.

DETAILED DESCRIPTION EXAMPLE 1 Preparation of a Universal Set of Probes

[0029] Two types of universal sets of probes may be prepared. The firstis a complete set (or at least a noncomplementary subset) of relativelyshort probes. For example, all 4096 (or about 2000 non-complementary)6-mers, or all 16,384 (or about 8,000 non-complementary) 7-mers. Fullnoncomplementary subsets of 8-mers and longer probes are less convenientin as much as they include 32,000 or more probes.

[0030] A second type of probe set is selected as a small subset ofprobes still sufficient for reading every bp in any sequence with atleast with one probe. For example, 12 of 16 dimers are sufficient. Asmall subset for 7-mers, 8-mer and 9-mers for sequencing double strandedDNA may be about 3000, 10,000 and 30,000 probes, respectively.

[0031] Probes may be prepared using standard chemistry with one to threenon-specified (mixed A, T, C and G) or universal (e.g. M base, inosine)bases at the ends. If radiolabelling is used, probes may have an OHgroup at the 5′ end for kinasing by radiolabelled phosphorous groups.Alternatively, probes labelled with fluorescent dyes may be employed.Other types of probes like PNA (Protein Nucleic Acids)or probescontaining modified bases which change duplex stability also may beused.

[0032] Probes may be stored in barcoded multiwell plates. For smallnumbers of probes, 96-well plates may be used; for 10,000 or moreprobes, storage in 384- or 864-well plates is preferred. Stacks of 5 to50 plates are enough to store all probes. Approximately 5 pg of a probemay be sufficient for hybridization with one DNA sample. Thus, from asmall synthesis of about 50 μg per probe, ten million samples may beanalyzed. If each probe is used for every third sample, and if eachsample is 1000 bp in length, then over 30 billion bases (10 humangenomes) may be sequenced by a set of 5,000 probes.

EXAMPLE 2 Preparation of DNA Samples

[0033] DNA fragments may be prepared as clones in M13, plasmid or lambdavectors and/or prepared directly from genomic DNA or cDNA by PCR orother amplification methods. Samples may be prepared or dispensed inmultiwell plates. About 100-1000 ng of DNA samples may be prepared in2-500 μl of final volume.

EXAMPLE 3 Preparation of DNA Arrays

[0034] Arrays may be prepared by spotting DNA samples on a support suchas a nylon membrane. Spotting may be performed by using arrays of metalpins (the positions of which correspond to an array of wells in amicrotiter plate) to repeated by transfer of about 20 nl of a DNAsolution to a nylon membrane. By offset printing, a density of dotshigher than the density of the wells is achieved. One to 25 dots may beaccommodated in 1 mm² depending on the type of label used. By avoidingspotting in some preselected number of rows and columns, separatesubsets (subarrays) may be formed. Samples in one subarray may be thesame genomic segment of DNA (or the same gene) from differentindividuals, or may be different, overlapped genomic clones. Each of thesubarrays may represent replica spotting of the same samples. In oneexample, one gene segment may be amplified from 64 patients. For eachpatient, the amplified gene segment may be in one 96-well plate (all 96wells containing the same sample). A plate for each of the 64 patientsis prepared. By using a 96-pin device all samples may be spotted on one8×12 cm membrane. Subarrays may contain 64 samples, one from eachpatient. Where the 96 subarrays are identical, the dot span may be 1 mm²and there may be a 1 mm space between subarrays.

[0035] Another approach is to use membranes or plates (available fromNUNC, Naperville, Ill.) which may be partitioned by physical spacerse.g. a plastic grid molded over the membrane, the grid being similar tothe sort of membrane applied to the bottom of multiwell plates, orhydrophobic strips. A fixed physical spacer is not preferred for imagingby exposure to flat phosphor-storage screens or x-ray films.

EXAMPLE 4 Selection and Labelling of Probes

[0036] When an array of subarrays is produced, the sets of probes to behybridized in each of the hybridization cycles on each of the subarraysis defined. For the samples in Example 3, a set of 384 probes may beselected from the universal set, and 96 probings may be performed ineach of 4 cycles. Probes selected to be hybridized in one cyclepreferably have similar G+C contents.

[0037] Selected probes for each cycle are transferred to a 96-well plateand then are labelled by kinasing or by other labelling procedures ifthey are not labelled (e.g. with stable fluorescent dyes) before theyare stored.

[0038] On the basis of the first round of hybridizations, a new set ofprobes may be defined for each of the subarrays for additional cycles.Some of the arrays may not be used in some of the cycles. For example,if only 8 of 64 patient samples exhibit a mutation and 8 probes arescored first for each mutation, then all 64 probes may be scored in onecycle and 32 subarrays are not used. These subarrays may then be treatedwith hybridization buffer to prevent drying of the filters.

[0039] Probes may be retrieved from the storing plates by any convenientapproach, such as a single channel pipetting device or a robotic stationsuch as a Beckman Biomek 1000 (Beckman Instruments, Fullerton, Calif.)or a Mega Two robot (Megamation, Lawrenceville, N.J.). A robotic stationmay be integrated with data analysis programs and probe managingprograms. Outputs of these programs may be inputs for one or morerobotic stations.

[0040] Probes may be retrieved one by one and added to subarrays coveredby hybridization buffer. It is preferred that retrieved probes be placedin a new plate and labelled or mixed with hybridization buffer. Thepreferred method of retrieval is by accessing stored plates one by oneand pipetting (or transferring by metal pins) a sufficient amount ofeach selected probe from each plate to specific wells in an intermediaryplate. An array of individually addressable pipettes or pins may be usedto speed up the retrieval process.

EXAMPLE 5 Hybridization and Scoring Process

[0041] Labelled probes may be mixed with hybridization buffer andpipetted preferentially by multichannel pipettes to the subarrays. Toprevent mixing of the probes between subarrays (if there are nohydrophilic strips or physical barriers imprinted in the membrane), acorresponding plastic, metal or ceramic grid may be firmly pressed tothe membrane. Also, the volume of the buffer may be reduced to about 1μl or less per mm². The concentration of the probes and hybridizationconditions used may be as described previously except that the washingbuffer may be quickly poured over the array of subarrays to allow fastdilution of probes and thus prevent significant cross-hybridization. Forthe same reason, a minimal concentration of the probes may be used andhybridization time extended to the maximal practical level. For DNAdetection and sequencing, knowledge of a “normal” sequence allows theuse of the continuous stacking interaction phenomenon to increase thesignal. In addition to the labelled probe, additional unlabelled probeswhich hybridize back to back with a labelled one may be added in thehybridization reaction. The amount of the hybrid may be increasedseveral times. The probes may be connected by ligation. This approachmay be important for resolving DNA regions forming “compressions”.

[0042] In the case of radiolabelled probes, images of the filters may beobtained preferentially by phosphorstorage technology. Fluorescentlabels may be scored by CCD cameras, confocal microscopy or otherwise.Raw signals are normalized based on the amount of target in each dot toproperly scale and integrate data from different hybridizationexperiments. Differences in the amount of target DNA per dot may becorrected for by dividing signals of each probe by an average signal forall probes scored on one dot. Also, the normalized signals may bescaled, usually from 1-100, to compare data from different experiments.Also, in each subarray, several control DNAs may be used to determine anaverage background signal in those samples which do not contain a fullmatch target. Furthermore, for samples obtained from diploid (polyploid)scores, homozygotic controls may be used to allow recognition ofheterozygotes in the samples.

EXAMPLE 6 Diagnostics—Scoring Known Mutations or Full Gene Resequencing

[0043] A simple case is to discover whether some known mutations occurin a DNA segment. Less than 12 probes may suffice for this purpose, forexample, 5 probes positive for one allele, 5 positive for the other, and2 negative for both. Because of the small number of probes to be scoredper sample, large numbers of samples may be analyzed in parallel. Forexample, with 12 probes in 3 hybridization cycles, 96 different genomicloci or gene segments from 64 patient may be analyzed on one 6×9 inmembrane containing 12×24 subarrays each with 64 dots representing thesame DNA segment from 64 patients. In this example, samples may beprepared in sixty-four 96-well plates. Each plate may represent onepatient, and each well may represent one of the DNA segments to beanalyzed. The samples from 64 plates may be spotted in four replicas asfour quarters of the same membrane.

[0044] A set of 12 probes may be selected by single channel pipetting ora single pin transferring device (or by an array of individuallycontrolled pipets or pins) for each of the 96 segments and rearranged intwelve 96-well plates. Probes may be labelled if they are notprelabelled before storing, and then probes from four plates may bemixed with hybridization buffer and added to the subarrayspreferentially by a 96-channel pipeting device. After one hybridizationcycle it is possible to strip off previously used probes by incubatingthe membrane at 37° to 55° C. in the preferably undiluted hybridizationor washing buffer.

[0045] The likelihood that probes positive for one allele are positiveand probes positive for the other allele are negative may be used todetermine which of the two allels is present. In this redundant scoringscheme, some level (about 10%) of errors in hybridization of each probemay be tolerated.

[0046] An incomplete set of probes may be used for scoring most of thealleles, especially if the smaller redundancy is sufficient, e.g. one ortwo probes which prove the presence or absence in a sample of one of thetwo alleles. For example, with a set of four thousand 8-mers there is a91% chance of finding at least one positive probe for one of the twoalleles for a randomly selected locus. The incomplete set of probes maybe optimized to reflect G+C content and other biases in the analyzedsamples.

[0047] For full gene sequencing, genes may be amplified in anappropriate number of segments. For each segment, a set of probes (aboutone probe per 2-4 bases) may be selected and hybridized. These probesmay identify whether there is a mutation anywhere in the analyzedsegments. Segments (i.e., subarrays which contain these segments) whereone or more mutated sites are detected may be hybridized with additionalprobes to find the exact sequence at the mutated sites. If a DNA sampleis tested by every second 6-mer, and a mutation is localized at theposition that is surrounded by positively hybridized probes TGCAAA andTATTCC and covered by three negative probes: CAAAAC, AAACTA and ACTATT,the mutated nucleotides must be A and/or C occurring in the normalsequence at that position. They may be changed by a single basemutation, or by a one or two nucleotide deletion and/or insertionbetween bases AA, AC or CT.

[0048] One approach is to select a probe that extends the positivelyhybridized probe TGCAAA for one nucleotide to the right, and whichextends the probe TATTCC one nucleotide to the left. With these 8 probes(GCAAAA, GCAAAT, GCAAAC, GCAAAG and ATATTC, TTATTC, CTATTC, GTATTC) twoquestionable nucleotides are determined.

[0049] The most likely hypothesis about the mutation may be determined.For example, A is found to be mutated to G. There are two solutionssatisfied by these results. Either replacement of A with G is the onlychange or there is in addition to that change an insertion of somenumber of bases between newly determined G and the following C. If theresult with bridging probes is negative these options may then bechecked first by at least one bridging probe comprising the mutatedposition (AAGCTA) and with an additional 8 probes: CAAAGA, CAAAGT,CAAAGC, CAAAGG and ACTATT, TCTATT, CCTATT, GCTATT, I There are manyother ways to select mutation-solving probes.

[0050] In the case of diploid, particular comparisons of scores for thetest samples and homozygotic control may be performed to identifyheterozygotes (see above). A few consecutive probes are expected to haveroughly twice smaller signals if the segment covered by these probes ismutated on one of the two chromosomes.

EXAMPLE 7 Identification of Genes (Mutations) Responsible for GeneticDisorders and Other Traits

[0051] The sequencing process disclosed herein has a very low cost perbp. Also, using larger universal sets of longer probes (8-mers or9-mers), DNA fragments as long as 5-20 kb may be sequenced withoutsubcloning. Furthermore, the speed of resequencing may be about 10million bp/day/hybridization instrument. This performance allows forresequencing a large fraction of human genes or the human genomerepeatedly from scientifically or medically interesting individuals. Toresequence 50% of the human genes, about 100 million bp is checked. Thatmay be done in a relatively short period of time at an affordable cost.

[0052] This enormous resequencing capability may be used in several waysto identify mutations and/or genes that encode for disorders or anyother traits. Basically, mRNAs (which may be converted into cDNAs) fromparticular tissues or genomic DNA of patients with particular disordersmay be used as starting materials. From both sources of DNA, separategenes or genomic fragments of appropriate length may be prepared eitherby cloning procedures or by in vitro amplification procedures (forexample by PCR). If cloning is used, the minimal set of clones to beanalyzed may be selected from the libraries before sequencing. That maybe done efficiently by hybridization of a small number of probes,especially if a small number of clones longer than 5 kb is to be sorted.Cloning may increase the amount of hybridization data about two times,but does not require tens of thousands of PCR primers.

[0053] In one variant of the procedure, gene or genomic fragments may beprepared by restriction cutting with enzymes like Hga I which cuts DNAin following way: GACGC(N5′)/CTGCG(N10′). Protruding ends of five basesare different for different fragments. One enzyme produces appropriatefragments for a certain number of genes. By cutting cDNA or genomic DNAwith several enzymes in separate reactions, every gene of interest maybe excised appropriately. In one approach, the cut DNA is fractionatedby size. DNA fragments prepared in this way (and optionally treated withExonuclease III which individually removes nucleotides from the 3′ endand increases length and specificity of the ends) may be dispensed inthe tubes or in multiwell plates. From a relatively small set of DNAadapters with a common portion and a variable protruding end ofappropriate length, a pair of adapters may be selected for every genefragment that needs to be amplified. These adapters are ligated and thenPCR is performed by universal primers. From 1000 adapters, a millionpairs may be generated, thus a million different fragments may bespecifically amplified in the identical conditions with a universal pairof primers complementary to the common end of the adapters.

[0054] If a DNA difference is found to be repeated in several patients,and that sequence change is nonsense or can change function of thecorresponding protein, then the mutated gene may be responsible for thedisorder. By analyzing a significant number of individuals withparticular traits, functional allelic variations of particular genescould be associated by specific traits.

[0055] This approach may be used to eliminate the need for veryexpensive genetic mapping on extensive pedigrees and has special valuewhen there is no such genetic data or material.

EXAMPLE 8 Scoring Single Nucleotide Polymorphisms in Genetic Mapping

[0056] Techniques disclosed in this application are appropriate for anefficient identification of genomic fragments with single nucleotidepolymorphisms (SNUPs). In 10 individuals by applying the describedsequencing process on a large number of genomic fragments of knownsequence that may be amplified by cloning or by in vitro amplification,a sufficient number of DNA segments with SNUPs may be identified. Thepolymorphic fragments are further used as SNUP markers. These markersare either mapped previously (for example they represent mapped STSs) orthey may be mapped through the screening procedure described below.

[0057] SNUPs may be scored in every individual from relevant families orpopulations by amplifying markers and arraying them in the form of thearray of subarrays. Subarrays contain the same marker amplified from theanalyzed individuals. For each marker, as in the diagnostics of knownmutations, a set of 6 or less probes positive for one allele and 6 orless probes positive for the other allele may be selected and scored.From the significant association of one or a group of the markers withthe disorder, chromosomal position of the responsible gene(s) may bedetermined. Because of the high throughput and low cost, thousands ofmarkers may be scored for thousands of individuals. This amount of dataallows localization of a gene at a resolution level of less than onemillion bp as well as localization of genes involved in polygenicdiseases. Localized genes may be identified by sequencing particularregions from relevant normal and affected individuals to score amutation(s).

[0058] PCR is preferred for amplification of markers from genomic DNA.Each of the markers require a specific pair of primers. The existingmarkers may be convertible or new markers may be defined which may beprepared by cutting genomic DNA by Hga I type restriction enzymes, andby ligation with a pair of adapters as described in Example 7.

[0059] SNUP markers can be amplified or spotted as pools to reduce thenumber of independent amplification reactions. In this case, more probesare scored per one sample. When 4 markers are pooled and spotted on 12replica membranes, then 48 probes (12 per marker) may be scored in 4cycles.

EXAMPLE 9 Detection and Verification of Identity of DNA Fragments

[0060] DNA fragments generated by restriction cutting, cloning or invitro amplification (e.g. PCR) frequently may be identified in aexperiment. Identification may be performed by verifying the presence ofa DNA band of specific size on gel electrophoresis. Alternatively, aspecific oligonucleotide may be prepared and used to verify a DNA samplein question by hybridization. The procedure developed here allows formore efficient identification of a large number of samples withoutpreparing a specific oligonucleotide for each fragment. A set ofpositive and negative probes may be selected from the universal set foreach fragment on the basis of the known sequences. Probes that areselected to be positive usually are able to form one or a fewoverlapping groups and negative probes are spread over the whole insert.

[0061] This technology may be used for identification of STSs in theprocess of their mapping on the YAC clones. Each of the STSs may betested on about 100 YAC clones or pools of YAC clones. DNAs from these100 reactions possibly are spotted in one subarray. Different STSs mayrepresent consecutive subarrays. In several hybridization cycles, asignature may be generated for each of the DNA samples, which signatureproves or disproves existence of the particular STS in the given YACclone with necessary confidence.

[0062] To reduce the number of independent PCR reactions or the numberof independent samples for spotting, several STSs may be amplifiedsimultaneously in a reaction or PCR samples may be mixed, respectively.In this case more probes have to be scored per one dot. The pooling ofSTSs is independent of pooling YACs and may be used on single YACs orpools of YACs. This scheme is especially attractive when several probeslabelled with different colors are hybridized together.

[0063] In addition to confirmation of the existence of a DNA fragment ina sample, the amount of DNA may be estimated using intensities of thehybridization of several separate probes or one or more pools of probes.By comparing obtained intensities with intensities for control sampleshaving a known amount of DNA, the quantity of DNA in all spotted samplesis determined simultaneously. Because only a few probes are necessaryfor identification of a DNA fragment, and there are N possible probesthat may be used for DNA N bases long, this application does not requirea large set of probes to be sufficient for identification of any DNAsegment. From one thousand 8-mers, on average about 30 full matchingprobes may be selected for a 1000 bp fragment.

EXAMPLE 10 Identification of Infectious Disease Organisms and TheirVariants

[0064] DNA-based tests for the detection of viral, bacterial, fungal andother parasitic organisms in patients are usually more reliable and lessexpensive than alternatives. The major advantage of DNA tests is to beable to identify specific strains and mutants, and eventually be able toapply more effective treatment. Two applications are described below.

[0065] The presence of 12 known antibiotic resistance genes in bacterialinfections may be tested by amplifying these genes. The amplifiedproducts from 128 patients may be spotted in two subarrays and 24subarrays for 12 genes may then be repeated four times on a 8×12 cmmembrane. For each gene, 12 probes may be selected for positive andnegative scoring. Hybridizations may be performed in 3 cycles. For thesetests, as for the tests in Example 9, a much smaller set of probes ismost likely to be universal. For example, from a set of one thousand8-mers, on average 30 probes are positive in 1000 bp fragments, and 10positive probes are usually sufficient for a highly reliableidentification. As described in Example 9, several genes may beamplified and/or spotted together and the amount of the given DNA may bedetermined. The amount of amplified gene may be used as an indicator ofthe level of infection.

[0066] Another example involves possible sequencing of one gene or thewhole genome of an HIV virus. Because of rapid diversification, thevirus poses many difficulties for selection of an optimal therapy. DNAfragments may be amplified from isolated viruses from up to 64 patientsand resequenced by the described procedure. On the basis of the obtainedsequence the optimal therapy may be selected. If there is a mixture oftwo virus types of which one has the basic sequence (similar to the caseof heterozygotes), the mutant may be identified by quantitativecomparisons of its hybridization scores with scores of other samples,especially control samples containing the basic virus type only. Scorestwice as small may be obtained for three to four probes that cover thesite mutated in one of the two virus types present in the sample (seeabove).

EXAMPLE 11 Forensic and Parental Identification Applications

[0067] Sequence polymorphisms make an individual genomic DNA unique.This permits analysis of blood or other body fluids or tissues from acrime scene and comparison with samples from criminal suspects. Asufficient number of polymorphic sites are scored to produce a uniquesignature of a sample. SBH may easily score single nucleotidepolymorphisms to produce such signatures.

[0068] A set of DNA fragments (10-1000) may be amplified from samplesand suspects. DNAs from samples and suspects representing one fragmentare spotted in one or several subarrays and each subarray may bereplicated 4 times. In three cycles, 12 probes may determine thepresence of allele A or B in each of the samples, including suspects,for each DNA locus. Matching the patterns of samples and suspects maylead to discovery of the suspect responsible for the crime.

[0069] The same procedure may be applicable to prove or disprove theidentity of parents of a child. DNA may be prepared and polymorphic lociamplified from the child and adults; patterns of A or B alleles may bedetermined by hybridization for each. Comparisons of the obtainedpatterns, along with positive and negative controls, aide in thedetermination of familial relationships. In this case, only asignificant portion of the alleles need match with one parent foridentification. Large numbers of scored loci allow for the avoidance ofstatistical errors in the procedure or of masking effects of de novomutations.

EXAMPLE 12 Assessing Genetic Diversity of Populations or Species andBiological Diversity of Ecological Niches

[0070] Measuring the frequency of allelic variations on a significantnumber of loci (for example, several genes or entire mitochondrial DNA)permits development of different types of conclusions, such asconclusions regarding the impact of the environment on the genotypes,history and evolution of a population or its susceptibility to diseasesor extinction, and others. These assessments may be performed by testingspecific known alleles or by full resequencing of some loci to be ableto define de novo mutations which may reveal fine variations or presenceof mutagens in the environment.

[0071] Additionally, biodiversity in the microbial world may be surveyedby resequencing evolutionarily conserved DNA sequences, such as thegenes for ribosomal RNAs or genes for highly conservative proteins. DNAmay be prepared from the environment and particular genes amplifiedusing primers corresponding to conservative sequences. DNA fragments maybe cloned preferentially in a plasmid vector (or diluted to the level ofone molecule per well in multiwell plates and than amplified in vitro).Clones prepared this way may be resequenced as described above. Twotypes of information are obtained. First of all, a catalogue ofdifferent species may be defined as well as the density of theindividuals for each species. Another segment of information may be usedto measure the influence of ecological factors or pollution on theecosystem. It may reveal whether some species are eradicated or whetherthe abundance ratios among species is altered due to the pollution. Themethod also is applicable for sequencing DNAs from fossils.

EXAMPLE 13 DNA Sequencing

[0072] An array of subarrays allows for efficient sequencing of a smallset of samples arrayed in the form of replicated subarrays; For example,64 samples may be arrayed on a 8×8 mm subarray and 16×24 subarrays maybe replicated on a 15×23 cm membrane with 1 mm wide spacers between thesubarrays. Several replica membranes may be made. For example, probesfrom a universal set of three thousand seventy-two 7-mers may be dividedin thirty-two 96-well plates and labelled by kinasing. Four membranesmay be processed in parallel during one hybridization cycle. On eachmembrane, 384 probes may be scored. All probes may be scored in twohybridization cycles. Hybridization intensities may be scored and thesequence assembled as described below.

[0073] If a single sample subarray or subarrays contains severalunknowns, especially when similar samples are used, a smaller number ofprobes may be sufficient if they are intelligently selected on the basisof results of previously scored probes. For example, if probe AAAAAAA isnot positive, there is a small chance that any of 8 overlapping probesare positive. If AAAAAAA is positive, then two probes are usuallypositive. The sequencing process in this case consists of firsthybridizing a subset of minimally overlapped probes to define positiveanchors and then to successively select probes which confirms one of themost likely hypotheses about the order of anchors and size and type ofgaps between them. In this second phase, pools of 2-10 probes may beused where each probe is selected to be positive in only one DNA samplewhich is different from the samples expected to be positive with otherprobes from the pool.

[0074] The subarray approach allows efficient implementation of probecompetition (overlapped probes) or probe cooperation (continuousstacking of probes) in solving branching problems. After hybridizationof a universal set of probes the sequence assembly program determinescandidate sequence subfragments (SFs). For the further assembly of SFs,additional information has to be provided (from overlapped sequences ofDNA fragments, similar sequences, single pass gel sequences, or fromother hybridization or restriction mapping data). Competitivehybridization and continuous stacking interactions have been proposedfor SF assembly. These approaches are of limited practical value forsequencing of large numbers of samples by SBH wherein a labelled probeis applied to a sample affixed to an array if a uniform array is used.Fortunately, analysis of small numbers of samples using replicasubarrays allows efficient implementation of both approaches. On each ofthe replica subarrays, one branching point may be tested for one or moreDNA samples using pools of probes similarly as in solving mutatedsequences in different samples spotted in the same subarray (see above).

[0075] If in each of 64 samples described in this example, there areabout 100 branching points, and if 8 samples are analyzed in parallel ineach subarray, then at least 800 subarray probings solve all branches.This means that for the 3072 basic probings an additional 800 probings(25%) are employed. More preferably, two probings are used for onebranching point. If the subarrays are smaller, less additional probingsare used. For example, if subarrays consist of 16 samples, 200additional probings may be scored (6%). By using 7-mer probes(N₁₋₂B₇N₁₋₂) and competitive or collaborative branching solvingapproaches or both, fragments of about 1000 bp fragments may beassembled by about 4000 probings. Furthermore, using 8-mer probes (NB₈N)4 kb or longer fragments may be assembled with 12,000 probings. Gappedprobes, for example, NB₄NB₃N or NB₄NB₄N may be used to reduce the numberof branching points.

EXAMPLE 14 DNA Analysis by Transient Attachment to Subarrays of Probesand Ligation of Labelled Probes.

[0076] Oligonucleotide probes having an informative length of four to 40bases are synthesized by standard chemistry and stored in tubes or inmultiwell plates. Specific sets of probes comprising one to 10,000probes are arrayed by deposition or in situ synthesis on separatesupports or distinct sections of a larger support. In the last case,sections or subarrays may be separated by physical or hydrophobicbarriers. The probe arrays may be prepared by in situ synthesis. Asample DNA of appropriate size is hybridized with one or more specificarrays. Many samples may be interrogated as pools at the same subarraysor independently with different subarrays within one support.Simultaneously with the sample or subsequently, a single labelled probeor a pool of labelled probes is added on each of the subarrays. Ifattached and labelled probes hybridize back to back on the complementarytarget in the sample DNA they are ligated. Occurrence of ligation willbe measured by detecting a label from the probe.

[0077] This procedure is a variant of the described DNA analysis processin which DNA samples are not permanently attached to the support.Transient attachment is provided by probes fixed to the support. In thiscase there is no need for a target DNA arraying process. In addition,ligation allows detection of longer oligonucleotide sequences bycombining short labelled probes with short fixed probes.

[0078] The process has several unique features. Basically, the transientattachment of the target allows its reuse. After ligation occur thetarget may be released and the label will stay covalently attached tothe support. This feature allows cycling the target and production ofdetectable signal with a small quantity of the target. Under optimalconditions, targets do not need to be amplified, e.g. natural sources ofthe DNA samples may be directly used for diagnostics and sequencingpurposes. Targets may be released by cycling the temperature betweenefficient hybridization and efficient melting of duplexes. Morepreferablly, there is no cycling. The temperature and concentrations ofcomponents may be defined to have an equilibrium between free targetsand targets entered in hybrids at about 50:50% level. In this case thereis a continuous production of ligated products. For different purposesdifferent equilibrium ratios are optimal.

[0079] An electric field may be used to enhance target use. At thebeginning, a horizontal field pulsing within each subarray may beemployed to provide for faster target sorting. In this phase, theequilibrium is moved toward hybrid formation, and unlabelled probes maybe used. After a target sorting phase, an appropriate washing (which maybe helped by a vertical electric field for restricting movement of thesamples) may be performed. Several cycles of discriminative hybridmelting, target harvesting by hybridization and ligation and removing ofunused targets may be introduced to increase specificity. In the nextstep, labelled probes are added and vertical electrical pulses may beapplied. By increasing temperature, an optimal free and hybridizedtarget ratio may be achieved. The vertical electric field preventsdiffusion of the sorted targets.

[0080] The subarrays of fixed probes and sets of labelled probes(specially designed or selected from a universal probe set) may bearranged in various ways to allow an efficient and flexible sequencingand diagnostics process. For example, if a short fragment (about 100-500bp) of a bacterial genome is to be partially or completely sequenced,small arrays of probes (5-30 bases in length) designed on the bases ofknown sequence may be used. If interrogated with a different pool of 10labelled probes per subarray, an array of 10 subarrays each having 10probes, allows checking of 200 bases, assuming that only two basesconnected by ligation are scored. Under the conditions where mismatchesare discriminated throughout the hybrid, probes may be displaced by morethan one base to cover the longer target with the same number of probes.By using long probes, the target may be interrogated directly withoutamplification or isolation from the rest of DNA in the sample. Also,several targets may be analyzed (screened for) in one samplesimultaneously. If the obtained results indicate occurrence of amutation(or a pathogen), additional pools of probes may be used todetect type of the mutation or subtype of pathogen. This is a desirablefeature of the process which may be very cost effective in preventivediagnosis where only a small fraction of patients is expected to have aninfection or mutation.

[0081] In the processes described in the examples, various detectionmethods may be used, for example, radiolabels, fluorescent labels,enzymes or antibodies (chemiluminescence), large molecules or particlesdetectable by light scattering or interferometric procedures.

EXAMPLE 15 Oligonucleotide Probes and Targets Suitable for SBH

[0082] In order to obtain experimental sequence data defined as a matrixof (number of fragments-clones)×(number of probes), the number of probesmay be reduced depending on the number of fragments used and vice versa.The optimal ratio of the two numbers is defined by the technologicalrequirements of a particular sequencing by hybridization process.

[0083] There are two parameters which influence the choice of probelength. The first is the success in obtaining hybridization results thatshow the required degree of discrimination. The second is thetechnological feasibility of synthesis of the required number of probes.

[0084] The requirement of obtaining sufficient hybridizationdiscrimination with practical and useful amounts of target nucleic acidlimits the probe length. It is difficult to obtain a sufficient amountof hybrid with short probes, and to discriminate end mismatches withlong probes. Traditionally the use of probes shorter than 11-mers in theliterature, is limited to very stable probes [Estivill et al., Nucl.Acids Res.15: 1415 (1987)] On the other hand, probes longer than 15bases discriminate end mismatches with difficulty (Wood et al., Proc.Natl. Acad. Sci. USA 82: 1585 (1985)].

[0085] One solution for the problems of unstable probes and end mismatchdiscrimination is the use of a group of longer probes representing asingle shorter probe in an informational sense. For example, groups ofsixteen 10-mers may be used instead of single 8-mers. Every member ofthe group has a common core 8-mer and one of three possible variationson outer positions with two variations at each end. The probe may berepresented as 5′ (A, T, C, G) (A, T, C, G) B₈ (A, T, C, G) 3′. Withthis type of probe one does not need to discriminate the non-informativeend bases (two on 5′ end, and one on 3′ end) since only the internal8-mer is read. This solution employs a higher mass amounts of probes andlabel in hybridization reactions.

[0086] These disadvantages are eliminated by the use of a few sets ofdiscriminative hybridization conditions for oligomer probes as short as6-mers.

[0087] The number of hybridization reactions is dependent on the numberof discrete labelled probes. Therefore in the cases of sequencingshorter nucleic acids using a smaller number of fragments-clones thanthe number of oligonucleotides, it is better to use oligomers as thetarget and nucleic acid fragment as probes.

[0088] Target nucleic acids which have undefined sequences may beproduced as a mixture of representative libraries in a phage or plasmidvector having inserts of genomic fragments of different sizes or insamples prepared by PCR. Inevitable gaps and uncertainties in alignmentof sequenced fragments arise from nonrandom or repetitive sequenceorganization of complex genomes and difficulties in cloning poisonoussequences in Escherichia coli. These problems are inherent in sequencinglarge complex molecules using any method. Such problems may be minimizedby the choice of libraries and number of subclones used forhybridization. Alternatively, such difficulties may be overcome throughthe use of amplified target sequences, e.g. by PCR amplification,ligation reactions, ligation-amplified reactions, etc.

[0089] Nucleic acids and methods for isolating, cloning and sequencingnucleic acids are well known to those of skill in the art. See e.g.,Ausubel et al., Current Protocols in Molecular Biology, Vol. 1-2, JohnWiley & Sons (1989); and Sambrook et al., Molecular Cloning A LaboratoryManual, 2nd Ed., Vols. 1-3, Cold Springs Harbor Press (1989), both ofwhich are incorporated by reference herein.

[0090] SBH is a well developed technology that may be practiced by anumber of methods known to those skilled in the art. Specifically,techniques related to sequencing by hybridization of the followingdocuments is incorporated by reference herein: Drmanac et al., U.S. Pat.No. 5,202,231 (hereby incorporated by reference herein)—Issued Apr. 13,1993; Drmanac et al., Genomics, 4, 114-128 (1989); Drmanac et al.,Proceedings of the First Int'l. Conf. Electrophoresis SupercomputingHuman Genome Cantor, DR & Lim HA eds, World Scientific Pub. Co.,Singapore, 47-59 (1991); Drmanac et al., Science, 260, 1649-1652 (1993);Lehrach et al., Genome Analysis: Genetic and Physical Mapping, 1, 39-81(1990), Cold Spring Harbor Laboratory Press; Drmanac et al., Nucl. AcidsRes., 4691 (1986); Stevanovic et al., Gene, 79, 139 (1989); Panusku etal., Mol. Biol. Evol., 1, 607 (1990); Nizetic et al., Nucl. Acids Res.,19, 182 (1991); Drmanac et al., J. Biomol. Struct. Dyn., 5, 1085 (1991);Hoheisel et al., Mol. Gen., 4, 125-132 (1991); Strezoska et al., Proc.Nat'l. Acad. Sci. (USA), 88, 10089 (1991); Drmanac et al., Nucl. AcidsRes., 19, 5839 (1991); and Drmanac et al., Int. J. Genome Res., 1, 59-79(1992).

EXAMPLE 16 Determining Sequence from Hybridization Data

[0091] Sequence assembly may be interrupted where ever a givenoverlapping (N−1) mer is duplicated two or more times. Then either ofthe two N-mers differing in the last nucleotide may be used in extendingthe sequence. This branching point limits unambiguous assembly ofsequence.

[0092] Reassembling the sequence of known oligonucleotides thathybridize to the target nucleic acid to generate the complete sequenceof the target nucleic acid may not be accomplished in some cases. Thisis because some information may be lost if the target nucleic acid isnot in fragments of appropriate size in relation to the size ofoligonucleotide that is used for hybridizing. The quantity ofinformation lost is proportional to the length of a target beingsequenced. However, if sufficiently short targets are used, theirsequence msy be unambiguously determined.

[0093] The probable frequency of duplicated sequences that wouldinterfere with sequence assembly which is distributed along a certainlength of DNA may be calculated. This derivation requires theintroduction of the definition of a parameter having to do with sequenceorganization: the sequence subfragment (SF). A sequence subfragmentresults if any part of the sequence of a target nucleic acid starts andends with an (N−1)mer that is repeated two or more times within thetarget sequence. Thus, subfragments are sequences generated between twopoints of branching in the process of assembly of the sequences in themethod of the invention. The sum of all subfragments is longer than theactual target nucleic acid because of overlapping short ends. Generally,subfragments may not be assembled in a linear order without additionalinformation since they have shared (N−1)mers at their ends and starts.Different numbers of subfragments are obtained for each nucleic acidtarget depending on the number of its repeated (N−1) mers. The numberdepends on the value of N−1 and the length of the target.

[0094] Probability calculations can estimate the interrelationship ofthe two factors. If the ordering of positive N-mers is accomplished byusing overlapping sequences of length N−1 or at an average distance ofA_(o), the N−1 of a fragment Lf bases long is given by equation one:

N _(sf)=1+A _(o) XEKXP(K, L _(f))

[0095] Where K greater than or=2, and P (K, L_(f)) represents theprobability of an N-mer occurring K-times on a fragment L_(f) base long.Also, a computer program that is able to form subfragments from thecontent of N-mers for any given sequence is described below in Example18.

[0096] The number of subfragments increases with the increase of lengthsof fragments for a given length of probe. Obtained subfragments may notbe uniquely ordered among themselves. Although not complete, thisinformation is very useful for comparative sequence analysis and therecognition of functional sequence characteristics. This type ofinformation may be called partial sequence. Another way of obtainingpartial sequence is the use of only a subset of oligonucleotide probesof a given length.

[0097] There may be relatively good agreement between predicted sequenceaccording to theory and a computer simulation for a random DNA sequence.For instance, for N−1=7, [using an 8-mer or groups of sixteen 10-mers oftype 5′ (A, T, C, G) B₈ (A, T, C, G) 3′] a target nucleic acid of 200bases will have an average of three subfragments. However, because ofthe dispersion around the mean, a library of target nucleic acid shouldhave inserts of 500 bp so that less than 1 in 2000 targets have morethan three subfragments. Thus, in an ideal case of sequencedetermination of a long nucleic acid of random sequence, arepresentative library with sufficiently short inserts of target nucleicacid may be used. For such inserts, it is possible to reconstruct theindividual target by the method of the invention. The entire sequence ofa large nucleic acid is then obtained by overlapping of the definedindividual insert sequences.

[0098] To reduce the need for very short fragments, e.g. 50 bases for8-mer probes. The information contained in the overlapped fragmentspresent in every random DNA fragmentation process like cloning, orrandom PCR is used. It is also possible to use pools of short physicalnucleic acid fragments. Using 8-mers or 11-mers like 5′ (A, T, C, G) N₈(A, T, C, G)3′ for sequencing 1 megabase, instead of needing 20,000 50bp fragments only 2,100 samples are sufficient. This number consists of700 random 7 kb clones (basic library), 1250 pools of 20 clones of 500bp (subfragments ordering library) and 150 clones from jumping (orsimilar) library. The developed algorithm (see Example 18) regeneratessequence using hybridization data of th these described samples.

EXAMPLE 17 Hybridization with Oligonucleotides

[0099] Oligonucleotides were either purchased from Genosys Inc.,Houston, Tex. or made on an Applied Biosystems 381A DNA synthesizer.Most of the probes used were not purified by HPLC or gelelectrophoresis. For example, probes were designed to have both a singleperfectly complementary target in interferon, a M13 clone containing a921 bp Eco RI-Bgl II human B1-interferon fragment (Ohno and Tangiuchi,Proc. Natl. Acad. Sci. 74: 4370-4374 (1981)], and at least one targetwith an end base mismatch in M13 vector itself.

[0100] End labelling of oligonucleotides was performed as described[Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold SpringHarbor Laboratory Cold Spring Harbor, N.Y. (1982)] in 10 μl containingT4-polynucleotide kinase (5 units Amersham), γ^(32p)-ATP (3.3 pM, 10 μCiAmersham 3000 Ci/mM) and oligonucleotide (4 pM, 10 ng). Specificactivities of the probes were 2.5-5×10 9 cpm/nM.

[0101] Single stranded DNA (2 to 4 μl in 0.5 NaOH, 1.5 M NaCl) wasspotted on a Gene Screen membrane wetted with the same solution, thefilters were neutralized in 0.05 M Na₂HPO₄ pH 6.5, baked in an oven at80° C. for 60 min. and UV irradiated for 1 min. Then, the filters wereincubated in hybridization solution (0.5 M Na₂HPO₄ pH 7.2, 7% sodiumlauroyl sarcosine for 5 min at room temperature and placed on thesurface of a plastic Petri dish. A drop of hybridization solution (10◯1, 0.5 M Na₂HPO₄ pH 7.2, 7% sodium lauroyl sarcosine) with a ³²P endlabelled oligomer probe at 4 nM concentration was placed over 1-6 dotsper filter, overlaid with a square piece of polyethylene (approximately1×1 cm.), and incubated in a moist chamber at the indicated temperaturesfor 3 hr. Hybridization was stopped by placing the filter in 6× SSCwashing solution for 3×5 minute at 0° C. to remove unhybridized probe.The filter was either dried, or further washed for the indicated timesand temperatures, and autoradiographed. For discrimination measurements,the dots were excised from the dried filters after autoradiography [aphosphoimager (Molecular Dynamics, Sunnyvale, Calif.) may be used]placed in liquid scintillation cocktail and counted. The uncorrectedratio of cpms for IF and M13 dots is given as D.

[0102] The conditions reported herein allow hybridization with veryshort oligonucleotides but ensure discriminations between matched andmismatched oligonucleotides that are complementary to and >thereforebind to a target nucleic acid. Factors which influence the efficientdetection of hybridization of specific short sequences based on thedegree of discriminations (D) between a perfectly complementary targetand an imperfectly complementary target with a single mismatch in thehybrid are defined. In experimental tests, dot blot hybridization oftwenty-eight probes that were 6 to 8 nucleotides in length to two M13clones or to model oligonucleotides bound to membrane filters wasaccomplished. The principles guiding the experimental procedures aregiven below.

[0103] Oligonucleotide hybridization to filter bound target nucleicacids only a few nucleotides longer than the probe in conditions ofprobe excess is a pseudo-first order reaction with respect to targetconcentration. This reaction is defined by:

S _(t) /S _(o) =e ⁻ ^(k) ^(h) ^([OP]t)

[0104] Wherein S_(t) and S_(o) are target sequence concentrations attime t and t₀, respectively. (OP) is probe concentration and t istemperature. The rate constant for hybrid formation, k_(h) increasesonly slightly in the 0° C. to 30° C. range (Porschke and Eigen, J. Mol.Biol. 62: 361 (1971); Craig et al., J. Mol. Biol. 62: 383 (1971)].Hybrid melting is a first order reaction with respect to hybridconcentration (here replaced by mass due to filter bound state) as shownin:

H _(t) /H _(o) =e ^(−k) ^(_(m)) ^(t)

[0105] In this equation, H_(t) and H_(o) are hybrid concentrations attimes t and t_(o), respectively; k_(m) is a rate constant for hybridmelting which is dependent on temperature and salt concentration [Ikutaet al., Nucl. Acids Res. 15: 797 (1987); Porsclike and Eigen, J. Mol.Biol. 62: 361 (1971); Craig et al., J. Mol. Biol. 62: 303 (1971)].During hybridization, which is a strand association process, the back,melting, or strand dissociation, reaction takes place as well. Thus, theamount of hybrid formed in time is result of forward and back reactions.The equilibrium may be moved towards hybrid formation by increasingprobe concentration and/or decreasing temperature. However, duringwashing cycles in large volumes of buffer, the melting reaction isdominant and the back reaction hybridization is insignificant, since theprobe is absent. This analysis indicates workable Short OligonucleotideHybridization (SOH) conditions call be varied for probe concentration ortemperature.

[0106] D or discrimination is defined in equation four:

D=H _(p)(t _(w))/H _(i)(t _(w))

[0107] H_(p) (t_(w)) and H_(i) (t_(w)) are the amounts hybrids remainingafter a washing time, t_(w), for the identical amounts of perfectly andimperfectly complementary duplex, respectively. For a given temperature,the discrimination D changes with the 10 length of washing time andreaches the maximal value when H_(i)=B which is equation five.

[0108] The background, B, represents the lowest hybridization signaldetectable in the system. Since any further decrease of H_(i) may not beexamined, D increases upon continued washing. Washing past t_(w) justdecreases H_(p) relative to B, and is seen as a decrease in D. Theoptimal washing time, t_(w), for imperfect hybrids, from equation threeand equation five is:

t _(w) =−ln(B/H _(i)(t ₀))/k _(m,i)

[0109] Since H_(p) is being washed for the same t_(w), combiningequations, one obtains the optimal discrimination function:

D=e ^(ln(B/Hi() ^(t) ^(0))km,p/km,i) XH _(p)(t ₀)/B

[0110] The change of D as a function, of T is important because of thechoice of an optimal washing temperature. It is obtained by substitutingthe Arhenius equation which is:

K−=Ae ⁻ ^(E) ^(α/) ^(RT)

[0111] into the previous equation to form the final equation:

D=H _(p)((t ₀)/BX(B/H _(i)(t ₀))⁽ ^(A) ^(p/) ^(A) ^(i)) ^(e(E) ^(αi)^(−E) ^(αp) ^()/RT) ;

[0112] Wherein B is less than H_(i) (t₀).

[0113] Since the activation energy for perfect hybrids, E_(α,p), and theactivation energy for imperfect hybrids, E_(α,i), can be either equal,or E_(α,i) less than E_(α,p) D is temperature independent, or decreaseswith increasing temperature, respectively. This result implies that thesearch for stringent temperature conditions for good discrimination inSOH is unjustified. By washing at lower temperatures, one obtains equalor better discrimination, but the time of washing exponentiallyincreases with the decrease of temperature. Discrimination more stronglydecreases with T, if H_(i)(t_(o)) increases relative to H_(p) (t₀).

[0114] D at lower temperatures depends to a higher degree on the H_(p)(t₀)/B ratio than on the H_(p) (t₀)/H_(i) (t₀) ratio. This resultindicates that it is better to obtain a sufficient quantity of H_(p) inthe hybridization regardless of the discrimination that can be achievedin this step. Better discrimination can then be obtained by washing,since the higher amounts of perfect hybrid allow more time fordifferential melting to show an effect. Similarly, using larger amountsof target nucleic acid a necessary discrimination can be obtained evenwith small differences between K_(m,p) and K_(m,i).

[0115] Extrapolated to a more complex situation than covered in thissimple model, the result is that washing at lower temperatures is evenmore important for obtaining discrimination in the case of hybridizationof a probe having many end-mismatches within a given nucleic acidtarget.

[0116] Using the described theoretical principles as a guide forexperiments, reliable hybridizations have been obtained with probes sixto eight nucleotides in length. All experiments were performed with afloating plastic sheet providing a film of hybridization solution abovethe filter. This procedure allows maximal reduction in the amount ofprobe, and thus reduced label costs in dot blot hybridizations. The highconcentration of sodium lauroyl sarcosine instead of sodium lauroylsulfate in the phosphate hybridization buffer allows dropping thereaction from room temperature down to 12° C. Similarly, the 4-6× SSC,10% sodium lauroyl sarcosine buffer allows hybridization at temperaturesas low as 2° C. The detergent in these buffers is for obtainingtolerable background with up to 40 nM concentrations of labelled probe.Preliminary characterization of the thermal stability of shortoligonucleotide hybrids was determined on a prototype octamer with 50%G+C content, i.e. probe of sequence TGCTCATG. The theoreticalexpectation is that this probe is among the less stable octamers. Itstransition enthalpy is similar to those of more stable heptamers or,even to probes 6 nucleotides in length (Bresslauer et al., Proc. Natl.Acad. Sci. U.S.A. 83: 3746 (1986)). Parameter T_(d), the temperature atwhich 50% of the hybrid is melted in unit time of a minute is 18° C. Theresult shows that T_(d) is 15° C. lower for the 8 bp hybrid than for an11 bp duplex [Wallace et al., Nucleic Acids Res. 6: 3543 (1979)].

[0117] In addition to experiments with model oligonucleotides, an M13vector was chosen as a system for a practical demonstration of shortoligonucleotide hybridization. The main aim was to show usefulend-mismatch discrimination with a target similar to the ones which willbe used in various applications of the method of the invention.Oligonucleotide probes for the M13 model were chosen in such a way thatthe M13 vector itself contains the end mismatched base. Vector IF, anM13 recombinant containing a 921 bp human interferon gene insert,carries single perfectly matched target. Thus, IF has either theidentical or a higher number of mismatched targets in comparison to theM13 vector itself.

[0118] Using low temperature conditions and dot blots, sufficientdifferences in hybridization signals were obtained between tie dotcontaining the perfect and the mismatched targets and the dot containingthe mismatched targets only. This was true for the 6-meroligonucleotides and was also true for the 7 and 8-mer oligonucleotideshybridized to the large IF-M13 pair of nucleic acids.

[0119] The hybridization signal depends on the amount of targetavailable on the filter for reaction with the probe. A necessary controlis to show that the difference in sign intensity is not a reflection ofvarying amounts of nucleic acid in the two dots. Hybridization with aprobe that has the same number and kind of targets in both IF and M13shows that there is an equal amount of DNA in the dots. Since theefficiency of hybrid formation increases with hybrid length, the signalfor a duplex having six nucleotides was best detected with a high massof oligonucleotide target bound to the filter. Due to their lowermolecular weight, a larger number of oligonucleotide target moleculescan be bound to a given surface area when compared to large molecules ofnucleic acid that serves as target.

[0120] To measure the sensitivity of detection with unpurified DNA,various amounts of phage supernatants were spotted on the filter andhybridized with a ³²P-labelled octamer. As little as 50 millionunpurified phage containing no more than 0.5 ng of DNA gave a detectablesignal indicating that sensitivity of the short oligonucleotidehybridization method is sufficient. Reaction time is short, adding tothe practicality.

[0121] As mentioned in the theoretical section above, the equilibriumyield of hybrid depends oil probe concentration and/or temperature ofreaction. For instance, the signal level for the same amount of targetwith 4 nM octamer at 13° C. is 3 times lower than with a probeconcentration of 40 nM, and is decreased 4.5-times by raising thehybridization temperature to 25° C.

[0122] The utility of the low temperature wash for achieving maximaldiscrimination is demonstrated. To make the phenomenon visually obvious,50 times more DNA was put in the M13 dot than in the IF dot usinghybridization with a vector specific probe. In this way, the signalafter the hybridization step with the actual probe was made stronger inthe, mismatched that in the matched case. The H_(p)/H_(i) ratio was 1:4.Inversion of signal intensities after prolonged washing at 7° C. wasachieved without a massive loss of perfect hybrid, resulting in a ratioof 2:1. In contrast, it is impossible to achieve any discrimination at25° C., since the matched target signal is already brought down to thebackground level with 2 minute washing; at the same time, the signalfrom the mismatched hybrid is still detectable. The loss ofdiscrimination at 13° C. compared to 7° C. is not so great but isclearly visible. If one considers the 90 minute point at 7° C. and the15 minute point at 13° C. when, the mismatched hybrid signal is near thebackground level, which represents optimal washing times for therespective conditions, it is obvious that the amount of several timesgreater at 7° C. than at 13° C. To illustrate this further, the timecourse of the change discrimination with washing of the same amount ofstarting hybrid at the two temperatures shows the higher maximal D atthe lower temperature. These results confirm the trend in the change ofD with temperature and the ratio of amounts of the two types of hybridat the start of the washing step.

[0123] In order to show the general utility of the short oligonucleotidehybridization conditions, we have looked hybridization of 4 heptamers,10 octamers and an additional 14 probes up to 12 nucleotides in lengthin our simple M13 system. These include-the nonamer GTTTTTTAA andoctamer GGCAGGCG representing the two extremes of GC content. AlthoughGC content and sequence are expected to influence the stability of shorthybrids [Bresslauer et al., Proc. Natl. Acad. Sci. U.S.A. 83: 3746(1986)], the low temperature short oligonucleotide conditions wereapplicable to all tested probes in achieving sufficient discrimination.Since the best discrimination value obtained with probes 13 nucleotidesin length was 20, a several fold drop due to sequence variation iseasily tolerated.

[0124] The M13 system has the advantage of showing the effects of targetDNA complexity on the levels of discrimination. For two octamers havingeither none or five mismatched targets and differing in only one GC pairthe observed discriminations were 18.3 and 1.7, respectively.

[0125] In order to show the utility of this method, three probes 8nucleotides in length were tested on a collection of 51 plasmid DNA dotsmade from a library in Bluescript vector. One probe was present andspecific for Bluescript vector but was absent in M13, while the othertwo probes had targets that were inserts of known sequence. This systemallowed the use of hybridization negative or positive control DNAs witheach probe. This probe sequence (CTCCCTTT) also had a complementarytarget in the interferon insert. Since the M13 dot is negative while theinterferon insert in either M13 or Bluescript was positive, thehybridization is sequence specific. Similarly, probes that detect thetarget sequence in only one of 51 inserts, or in none of the examinedinserts along with controls that confirm that hybridization would haveoccurred if the appropriate targets were present in the clones.

[0126] Thermal stability curves for very short oligonucleotide hybridsthat are 6-8 nucleotides in length are at least 15° C. lower than forhybrids 11-12 nucleotides in length [FIG. 1and Wallace et al., NucleicAcids Res. 6: 3543-3557 (1979)]. However, performing the hybridizationreaction at a low temperature and with a very practical 0.4-40 nMconcentration of oligonucleotide probe allows the detection ofcomplementary sequence in a known or unknown nucleic acid target. Todetermine an unknown nucleic acid sequence completely, an entire setcontaining 65,535 8-mer probes may be used. Sufficient amounts ofnucleic acid for this purpose are present in convenient biologicalsamples such as a few microliters of M13 culture, a plasmid prep from 10ml of bacterial culture or a single colony of bacteria, or less than 1μl of a standard PCR reaction.

[0127] Short oligonucleotides 6-10 nucleotides long give excellentdiscrimination. The relative decrease in hybrid stability with a singleend mismatch is greater than for longer probes. Results with the octamerTGCTCATG support this conclusion. In the experiments, the target with aG/T end mismatch, hybridization to the target of this type of mismatchis the most stable of all other types of oligonucleotide. Thisdiscrimination achieved is the same as or greater than an internal G/Tmismatch in a 19 base paired duplex greater than an internal G/Tmismatch in a 19 paired duplex [Ikuta et al., Nucl. Acids res. 15: 797(1987)]. Exploiting these discrimination properties using the describedhybridization conditions for short oligonucleotide hybridization allowsa very precise determination of oligonucleotide targets.

[0128] In contrast to the ease of detecting discrimination betweenperfect and imperfect hybrids, a problem that may exist with using veryshort oligonucleotides is the preparation of sufficient amounts ofhybrids. In practice, the need to discriminate H_(p) and H_(i) is aidedby increasing the amount of DNA in the dot and/or the probeconcentration, or by decreasing the hybridization temperature. However,higher probe concentrations usually increase background. Moreover, thereare limits to the amounts of target nucleic acid that are practical touse. This problems was solved by the higher concentration of thedetergent Sarcosyl which gave an effective background with 4 nM ofprobe. Further improvements may be effected either in the use ofcompetitors for unspecific binding of probe to filter, or by changingthe hybridization support material. Moreover, for probes having E_(α)less than 45 Kcal/mol (e.g. for many heptamers and a majority ofhexamers, modified oligonucleotides give a more stable hybrid [Asseline,et al., Proc. Nat'l Acad. Sci. 81: 3297 (1984)] than their unmodifiedcounterparts. The hybridization conditions described in this inventionfor short oligonucleotide hybridization using low temperatures givebetter discriminating for all sequences and duplex hybrid inputs. Theonly price paid in achieving uniformity in hybridization conditions fordifferent sequences is an increase in washing time from minutes to up to24 hours depending on the sequence. Moreover, the washing time can befurther reduced by decreasing the salt concentration.

[0129] Although there is excellent discrimination of one matched hybridover a mismatched hybrids, in short oligonucleotide hybridization,signals from mismatched hybrids exist, with the majority of the mismatchhybrids resulting from end mismatch. This may limit insert sizes thatmay be effectively examined by a probe of a certain length.

[0130] The influence of sequence complexity on discrimination cannot beignored. However, the complexity effects are more significant whendefining sequence information by short oligonucleotide hybridization forspecific, nonrandom sequences, and can be overcome by using anappropriate probe to target length ratio. The length ratio is chosen tomake unlikely, on statistical grounds, the occurrence of specificsequences which have a number of end-mismatches which would be able toeliminate or falsely invert discrimination. Results suggest the use ofoligonucleotides 6, 7, and 8 nucleotides in length on target nucleicacid inserts shorter than 0.6, 2.5, and 10 kb, respectively.

EXAMPLE 18 Sequencing a Target Using Octamers and Nonamers

[0131] In this example, hybridization conditions that were used aredescribed supra in Example 17. Data resulting from the hybridization ofoctamer and nonamer oligonucleotides shows that sequencing byhybridization provides an extremely high degree of accuracy. In thisexperiment, a known sequence was used to predict a series of contiguousoverlapping component octamer and nonamer oligonucleotides.

[0132] In addition to the perfectly matching oligonucleotides, mismatcholigonucleotides, mismatch oligonucleotides wherein internal or endmismatches occur in the duplex formed by the oligonucleotide and thetarget were examined. In these analyses, the lowest practicaltemperature was used to maximize hybridization formation. Washes wereaccomplished at the same or lower temperatures to ensure maximaldiscrimination by utilizing the greater dissociation rate of mismatchversus matched oligonucleotide/target hybridization. These conditionsare shown to be applicable to all sequences although the absolutehybridization yield is shown to be sequence dependent.

[0133] The least destabilizing mismatch that can be postulated is asimple end mismatch, so that the test of sequencing by hybridization isthe ability to discriminate perfectly matched oligonucleotide/targetduplexes from end-mismatched oligonucleotide/target duplexes.

[0134] The discriminative values for 102 of 105 hybridizingoligonucleotides in a dot blot format were greater than 2 allowing ahighly accurate generation of the sequence. This system also allowed ananalysis of the effect of sequence on hybridization formation andhybridization instability.

[0135] One hundred base pairs of a known portion of a human $-interferongenes prepared by PCR, i.e. a 100 bp target sequence, was generated withdata resulting from the hybridization of 105 oligonucleotides probes ofknown sequence to the target nucleic acid. The oligonucleotide probesused included 72 octamer and 21 nonamer oligonucleotides whose sequencewas perfectly complementary to the target. The set of 93 probes providedconsecutive overlapping frames of the target sequence e displaced by oneor two bases.

[0136] To evaluate the effect of mismatches, hybridization was examinedfor 12 additional probes that contained at least one end mismatch whenhybridized to the 100 bp test target sequence. Also tested was thehybridization of twelve probes with target end-mismatched to four othercontrol nucleic acid sequences chosen so that the 12 oligonucleotidesformed perfectly matched duplex hybrids with the four control DNAs.Thus, the hybridization of internal mismatched, end-mismatched andperfectly matched duplex pairs of oligonucleotide and target wereevaluated for each oligonucleotide used in the experiment. The effect ofabsolute DNA target concentration on the hybridization with the testoctamer and nonamer oligonucleotides was determined by defining targetDNA concentration by detecting hybridization of a differentoligonucleotide probe to a single occurrence non-target site within theco-amplified plasmid DNA.

[0137] The results of this experiment showed that all oligonucleotidescontaining perfect matching complementary sequence to the target orcontrol DNA hybridized more strongly than those oligonucleotides havingmismatches. To come to this conclusion, we examined H_(p) and D valuesfor each probe. H_(p) defines the amount of hybrid duplex formed betweena test target and an oligonucleotide probe. By assigning values ofbetween 0 and 10 to the hybridization obtained for the 105 probes, itwas ) apparent that 68.5% of the 105 probes had an H_(p) greater than 2.

[0138] Discrimination (D) values were obtained where D was defined asthe ratio of signal intensities between 1) the dot containing a perfectmatched duplex formed between test oligonucleotide and target or controlnucleic acid and 2) the dot containing a mismatch duplex formed betweenthe same oligonucleotide and a different site within the target orcontrol nucleic acid. Variations in the value of D result from either 1)perturbations in the hybridization efficiency which allows visualizationof signal over background, or 2) the type of mismatch found between thetest oligonucleotide and the target. The D values obtained in thisexperiment were between 2 and 40 for 102 of the 105 oligonucleotideprobes examined. Calculations of D for the group of 102 oligonucleotidesas a whole showed the average D was 10.6.

[0139] There were 20 cases where oligonucleotide/target duplexesexhibited an end-mismatch. In five of these, D was greater than 10. Thelarge D value in these cases is most likely due to hybridizationdestabilization caused by other than the most stable (G/T and G/A) endmismatches. The other possibility is there was an error in the sequenceof either the oligonucleotides or the target.

[0140] Error in the target for probes with low H_(p) was excluded as apossibility because such an error would have affected the hybridizationof each of the other eight overlapping oligonucleotides. There was noapparent instability due to sequence mismatch for the other overlappingoligonucleotides, indicating the target sequence was correct. Error inthe oligonucleotide sequence was excluded as a possibility after thehybridization of seven newly synthesized oligonucleotides wasre-examined. Only 1 of the seven oligonucleotides resulted in a better Dvalue. Low hybrid formation values may result from hybrid instability orfrom an inability to form hybrid duplex. An inability to form hybridduplexes would result from either 1) self complementarity of the chosenprobe or 2) target/target self hybridization.Oligonucleotide/oligonucleotide duplex formation may be favored overoligonucleotide/target hybrid duplex formation if the probe wasself-complementary. Similarly, target/target association may be favoredif the target was self-complementary or may form internal palindromes.In evaluating these possibilities, it was apparent from probe analysisthat the questionable probes did not form hybrids with themselves.Moreover, in examining the contribution of target/target hybridization,it was determined that one of the questionable oligonucleotide probeshybridized inefficiently with two different DNAs containing the sametarget. The low probability that two different DNAs have aself-complementary region for the same target sequence leads to theconclusion that target/target hybridization did not contribute to lowhybridization formation. Thus, these results indicate that hybridinstability and not the inability to form hybrids was the cause of thelow hybrid formation observed for, specific oligonucleotides. Theresults also indicate that low hybrid formation is due to the specificsequences of certain oligonucleotides. Moreover, the results indicatethat reliable results may be obtained to generate sequences if octamerand nonamer oligonucleotides are used.

[0141] These results show that using the methods described longsequences of any specific target nucleic acid may be generated bymaximal and unique overlap of constituent oligonucleotides. Suchsequencing methods are dependent on the content of the individualcomponent oligomers regardless of their frequency and their position.

[0142] The sequence which is generated using the algorithm describedbelow is of high fidelity. The algorithm tolerates false positivesignals from the hybridization dots as is indicated from the fact thesequence generated from the 105 hybridization values, which includedfour less reliable values, was correct. This fidelity in sequencing byhybridization is due to the “all or none” kinetics of shortoligonucleotide hybridization and the difference in duplex stabilitythat exists between perfectly matched duplexes and mismatched duplexes.The ratio of duplex stability of matched and end-mismatched duplexesincreases with decreasing duplex length. Moreover, binding energydecreases with decreasing duplex length resulting in a lowerhybridization efficiency. However, the results provided show thatoctamer hybridization allows the balancing of the factors affectingduplex stability and discrimination to produce a highly accurate methodof sequencing by hybridization. Results presented in other examples showthat oligonucleotides that are 6, 7, or 8 nucleotides can be effectivelyused to generate reliable sequence on targets that are 0.5 kb (forhexamers) 2 kb (for septamers) and 6 kb (for octamers). The sequence oflong fragments may be overlapped to generate a complete genome sequence.

[0143] An algorithm to determine sequence by hybridization is describedin Example 18.

EXAMPLE 19 Algorithm

[0144] This example describes an algorithm for generation of a longsequence written in a four letter alphabet from constituent k-tuplewords in a minimal number of separate, randomly defined fragments of astarting nucleic acid sequence where K is the length of anoligonucleotide probe. The algorithm is primarily intended for use inthe sequencing by hybridization (SBH) process. The algorithm is based onsubfragments (SF), informative fragments (IF) and the possibility ofusing pools of physical nucleic sequences for defining informativefragments.

[0145] As described, subfragments may be caused by branch points in theassembly process resulting from the repetition of a K−1 Oligomersequence in a target nucleic acid. Subfragments are sequence fragmentsfound between any two repetitive words of the length K−1 that occur in asequence. Multiple occurrences of K−1 words are the cause ofinterruption of ordering the overlap of K-words in the process ofsequence generation. Interruption leads to a sequence remaining in theform of subfragments. Thus, the unambiguous segments between branchingpoints whose order is not uniquely determined are called sequencesubfragments.

[0146] Informative fragments are defined as fragments of a sequence thatare determined by the nearest ends of overlapped physical sequencefragments.

[0147] A certain number of physical fragments may be pooled withoutlosing the possibility of defining informative fragments. The totallength of randomly pooled fragments depends on the length of k-tuplesthat are used in the sequencing process.

[0148] The algorithm consists of two main units. The first part is usedfor generation of subfragments from the set of k-tuples contained in asequence. Subfragments may be generated within the coding region ofphysical nucleic acid sequence of certain sizes, or within theinformative fragments defined within long nucleic acid sequences. Bothtypes of fragments are members of the basic library. This algorithm doesnot describe the determination of the content of the k-tuples of theinformative fragments of the basic library, i.e. the step of preparationof informative fragments to be used in the sequence generation process.

[0149] The second part of the algorithm determines the linear order ofobtained subfragments with the purpose of regenerating the completesequence of the nucleic acid fragments of the basic library. For thispurpose a second, ordering library is used, made of randomly pooledfragments of the starting sequence. The algorithm does not include thestep of combining sequences of basic fragments to regenerate an entire,megabase plus sequence. This may be accomplished using the link-up offragments of the basic library which is a prerequisite for informativefragment generation. Alternatively, it may be accomplished aftergeneration of sequences of fragments of the basic library by thisalgorithm, using search for their overlap, based on the presence ofcommon end-sequences.

[0150] The algorithm requires neither knowledge of the number ofappearances of a given k-tuple in a nucleic acid sequence of the basicand ordering libraries, nor does it require the information of whichk-tuple words are present on the ends of a fragment. The algorithmoperates with the mixed content of k-tuples of various length. Theconcept of the algorithm enables operations with the k-tuple sets thatcontain false positive and false negative k-tuples. Only in specificcases does the content of the false k-tuples primarily influence thecompleteness and correctness of the generated sequence. The algorithmmay be used for optimization of parameters in simulation experiments, aswell as for sequence generation in the actual SBH experiments e.g.generation of the genomic DNA sequence. In optimization of parameters,the choice of the oligonucleotide probes (k-tuples) for practical andconvenient fragments and/or the choice of the optimal lengths and thenumber of fragments for the defined probes are especially important.

[0151] This part of the algorithm has a central role in the process ofthe generation of the sequence from the content of k-tuples. It is basedon the unique ordering of k-tuples by means of maximal overlap. The mainobstacles in sequence generation are specific repeated sequences andfalse positive and/or negative k-tuples. The aim of this part of thealgorithm is to obtain the minimal number of the longest possiblesubfragments, with correct sequence. This part of the algorithm consistsof one basic, and several control steps. A two-stage process isnecessary since certain information can be used only after generation ofall primary subfragments.

[0152] The main problem of sequence generation is obtaining a repeatedsequence from word contents that by definition do not carry informationon the number of occurrences of the particular k-tuples. The concept ofthe entire algorithm depends on the basis on which this problem issolved. In principle, there are two opposite approaches: 1) repeatedsequences may be obtained at the beginning, in the process of generationof pSFs, or 2) repeated sequences can be obtained later, in the processof the final ordering of the subfragments. In the first case, pSFscontain an excess of sequences and in the second case, they contain adeficit of sequences. The first approach requires elimination of theexcess sequences generated,; and the second requires permitting multipleuse of some of the subfragments in the process of the final assemblingof the sequence.

[0153] The difference in the two approaches in the degree of strictnessof the rule of unique overlap of k-tuples. The less severe rule is:k-tuple X is unambiguously maximally overlapped with k-tuple Y if andonly if, the rightmost k−1 end of k-tuple X is present only on theleftmost end of k-tuple Y. This rule allows the generation of repetitivesequences and the formation of surplus sequences.

[0154] A stricter rule which is used in the second approach has anaddition caveat: k-tuple X is unambiguously maximally overlapped withk-tuple Y if and only if, the rightmost K−1 end of k-tuple X is presentonly on the leftmost end of k-tuple Y and if the leftmost K−1 end ofk-tuple Y is not present on the rightmost end of any other k-tuple. Thealgorithm based on the stricter rule is simpler, and is describedherein.

[0155] The process of elongation of a given subfragment is stopped whenthe right k−1 end of the last k-tuple included is not present on theleft end of any k-tuple or is present on two or more k-tuples. If it ispresent on only one k-tuple the second part of the rule is tested. If inaddition there is a k-tuple which differs from the previously includedone, the assembly of the given subfragment is terminated only on thefirst leftmost position. If this additional k-tuple does not exist, theconditions are met for unique k−1 overlap and a given subfragment isextended to the right by one element.

[0156] Beside the basic rule, a supplementary one is used to allow theusage of k-tuples of different lengths. The maximal overlap is thelength of k−1 of the shorter k-tuple of the overlapping pair. Generationof the pSFs is performed starting from the first k-tuple from the filein which k-tuples are displayed randomly and independently from theirorder in a nucleic acid sequence. Thus, the first k-tuple in the file isnot necessarily on the beginning of the sequence, nor on the start ofthe particular subfragment. The process of subfragment generation isperformed by ordering the k-tuples by means of unique overlap, which isdefined by the described rule. Each used k-tuple is erased from thefile. At the point when there are no further k-tuples unambiguouslyoverlapping with the last one included, the building of subfragment isterminated and the buildup of another pSF is started. Since generationof a majority of subfragments does not begin from their actual starts,the formed pSF are added to the k-tuple file and are considered as alonger k-tuple. Another possibility is to form subfragments going inboth directions from the starting k-tuple. The process ends when furtheroverlap, i.e. the extension of any of the subfragments, is not possible.

[0157] The pSFs can be divided in three groups: 1).Subfragments of themaximal length and correct sequence in cases of exact k-tuple set; 2)short subfragments, formed due to the used of the maximal andunambiguous overlap rule on the incomplete set, and/or the set with somefalse positive k-tuples; and 3) pSFs of an incorrect sequence. Theincompleteness of the set in 2) is caused by false negative results of ahybridization experiment, as well as by using an incorrect set ofk-tuples. These are formed due to the false positive and false negativek-tuples and can be: a) misconnected subfragments; b) subfragments withthe wrong end; and c) false positive k-tuples which appears as falseminimal subfragments.

[0158] Considering false positive k-tuples, there is the possibility forthe presence of a k-tuple containing more than one wrong base orcontaining one wrong base somewhere in the middle, as well as thepossibility for a k-tuple with a wrong base on the end. Generation ofshort, erroneous or misconnected subfragments is caused by the latterk-tuples. The k-tuples of the former two kinds represent wrong pSFs withlength equal to k-tuple length.

[0159] In the case of one false negative k-tuple, pSFs are generatedbecause of the impossibility of maximal overlapping. In the case of thepresence of one false positive k-tuple with the wrong base on itsleftmost or rightmost end, pSFs are generated because of theimpossibility of unambiguous overlapping. When both false positive andfalse negative k-tuples with a common k−1 sequence are present in thefile, pSFs are generated, and one of these pSFs contains the wrongk-tuple at the relevant end.

[0160] The process of correcting subfragments with errors in sequenceand the linking of unambiguously connected pSF is performed aftersubfragment generation and in the process of subfragment ordering. Thefirst step which consists of cutting the misconnected pSFs and obtainingthe final subfragments by unambiguous connection of pSFs is describedbelow.

[0161] There are two approaches for the formation of misconnectedsubfragments. In the first a mistake occurs when an erroneous k-tupleappears on the points of assembly of the repeated sequences of lengthsk−1. In the second, the repeated sequences are shorter than k−1. Thesesituations can occur in two variants each. In the first variant, one ofthe repeated sequences represents the end of a fragment. In the secondvariant, the repeated sequence occurs at any position within thefragment. For the first possibility, the absence of some k-tuples fromthe file (false negatives) is required to generate a misconnection. Thesecond possibility requires the presence of both false negative andfalse positive k-tuples in the file. Considering the repetitions of k−1sequence, the lack of only one k-tuple is sufficient when either end isrepeated internally. The lack of two is needed for strictly internalrepetition. The reason is that the end of a sequence can be consideredinformatically as an endless linear array of false negative k-tuples.From the “smaller than k−1 case”, only the repeated sequence of thelength of k−2, which requires two or three specific erroneous k-tuples,will be considered. It is very likely that these will be the only caseswhich will be detected in a real experiment, the others being much lessfrequent.

[0162] Recognition of the misconnected subfragments is more strictlydefined when a repeated sequence does not appear at the end of thefragment. In this situation, one can detect further two subfragments,one of which contains on its leftmost, and the other on its rightmostend k−2 sequences which are also present in the misconnectedsubfragment. When the repeated sequence is on the end of the fragment,there is only one subfragment which contains k−2 sequence causing themistake in subfragment formation on its leftmost or rightmost end.

[0163] The removal of misconnected subframents by their cutting isperformed according to the common rule: If the leftmost or rightmostsequence of the length of k−2 of any subfragments is present in anyother subfragment, the subfragment is to be cut into two subfragments,each of them containing k−2 sequence. This rule does not cover rarersituations of a repeated end when there are more than one false negativek-tuple on the point of repeated k−1 sequence. Misconnected subfragmentsof this kind can be recognized by using the information from theoverlapped fragments, or informative fragments of both the basic andordering libraries. In addition, the misconnected subfragment willremain when two or more false negative k-tuples occur on both positionswhich contain the identical k−1 sequence. This is a very rare situationsince it requires at least 4 specific false k-tuples. An additional rulecan be introduced to cut these subfragments on sequences of length k ifthe given sequence can be obtained by combination of sequences shorterthan k−2 from the end of one subfragment and the start of another.

[0164] By strict application of the described rule, some completeness islost to ensure the accuracy of the output. Some of the subfragments willbe cut although they are not misconnected since they fit into thepattern of a misconnected subfragment. There are several situations ofthis kind. For example, a fragment, beside at least two identical k−1sequences, contains any k−2 sequence from k−1 or a fragment contains k−2sequence repeated at least twice and at least one false negative k-tuplecontaining given k−2 sequence in the middle, etc.

[0165] The aim of this part of the algorithm is to reduce the number ofpSFs to a minimal number of longer subfragments with correct sequence.The generation of unique longer subfragments or a complete sequence ispossible in two situations. The first situation concerns the specificorder of repeated k−1 words. There are cases in which some or allmaximally extended pSFs (the first group of pSFs) can be uniquelyordered. For example, in fragment S-R1-a-R2-b-R1-c-R2-E where S and Eare the start and end of a fragment, a, b, and c are different sequencesspecific to respective subfragments and R1 and R2 are two k−1 sequencesthat are tandemly repeated, five subfragments are generated (S-R1,R1-a-R2, R2-b-R1, R1-c-R2, and R-E). They may be ordered in two ways;the original sequence above or S-R1-c-R-b-R1-a-R-E. In contrast, in afragment with the same number and types of repeated sequences butordered differently, i.e. S-R1-a-R1-b-R-c-R-E, there is no othersequence which includes all subfragments. Examples of this type can berecognized only after the process of generation of pSFs. They representthe necessity for two steps in the process of pSF generation. The secondsituation of generation of false short subfragments on positions ofnonrepeated k−1 sequences when the files contain false negative and/orpositive k-tuples is more important.

[0166] The solution for both pSF groups consists of two parts. First,the false positive k-tuples appearing as the nonexisting minimalsubfragments are eliminated. All k-tuple subfragments of length k whichdo not have an overlap on either end, of the length of longer than k−aon one end and longer than k−b on the other end, are eliminated toenable formation of the maximal number of connections. In ourexperiments, the values for a and b of 2 and 3, respectively, appearedto be adequate to eliminate a sufficient number of false positivek-tuples.

[0167] The merging of subfragments that can be uniquely connected isaccomplished in the second step. The rule tot connection is: twosubfragments may be unambiguously connected if, and only if, theoverlapping sequence at the relevant end or start of two subfragments isnot present at the start and/or end of any other subfragment.

[0168] The exception is if one subfragment from the considered pair hasthe identical beginning and end. In that case connection is permitted,even if there is another subfragment with the same end present in thefile. The main problem here is the precise definition of overlappingsequence. The connection is not permitted if the overlapping sequenceunique for only one pair of subfragments is shorter than k−2, of it isk−2 or longer but an additional subfragment exists with the overlappingsequence of any length longer than k−4. Also, both the canonical ends ofpSFs and the ends after omitting one (or few) last bases are consideredas the overlapping sequences.

[0169] After this step some false positive k-tuples (as minimalsubfragments) and some subfragments with a wrong end may survive. Inaddition, in very rare occasions where a certain number of some specificfalse k-tuples are simultaneously present, an erroneous connection maytake place. These cases will be detected and solved in the subfragmentordering process, and in the additional control steps along with thehandling of uncut “misconnected” subfragments.

[0170] The short subfragments that are obtained are of two kinds. In thecommon case, these subfragments may be unambiguously connected amongthemselves because of the distribution of repeated k−1 sequences. Thismay be done after the process of generation of pSFs and is a goodexample of the necessity for two steps in the process of pSF generation.In the case of using the file containing false positive and/or falsenegative k-tuples, short pSFs are obtained on the sites of nonrepeatedk−1 sequences. Considering false positive k-tuples, a k-tuple maycontain more than one wrong base (or containing one wrong base somewherein the middle), as well as k-tuple on the end. Generation of short anderroneous (or misconnected) subfragments is caused by the latterk-tuples. The k-tuples of the former kind represent wrong pSFs withlength equal to k-tuple length.

[0171] The aim of merging pSF part of the algorithm is the reduction ofthe number of pSFs to the minimal number of longer subfragments with thecorrect sequence. All k-tuple subfragments that do not have an overlapon either end, of the length of longer than k−a on one, and longer thank−b on the other end, are eliminated to enable the maximal number ofconnections. In this way, the majority of false positive k-tuples arediscarded. The rule for connection is: two subfragments can beunambiguously connected if, and only if the overlapping sequence of therelevant end or start of two subfragments is not present on the startand/or end of any other subfragment. The exception is a subfragment withthe identical beginning and end. In that case connection is permitted,provided that there is another subfragment with the same end present inthe file. The main problem here is of precise definition of overlappingsequence. The presence of at least two specific false negative k-tupleson the points of repetition of k−1 or k−2 sequences, as well ascombining of the false positive and false negative k-tuples may destroyor “mask” some overlapping sequences and can produce an unambiguous, butwrong connection of pSFs. To prevent this, completeness must besacrificed on account of exactness: the connection is not permitted onthe end-sequences shorter than k−2, and in the presence of an extraoverlapping sequence longer than k−4. The overlapping sequences aredefined from the end of the pSFs, or omitting one, or few last bases.

[0172] In the very rare situations, with the presence of a certainnumber of some specific false positive and false negative k-tuples, somesubfragments with the wrong end can survive, some false positivek-tuples (as minimal subfragments) can remain, or the erroneousconnection can take place. These cases are detected and solved in thesubfragments ordering process, and in the additional control steps alongwith the handling of uncut, misconnected subfragments.

[0173] The process of ordering of subfragments is similar to the processof their generation. If one considers subfragments as longer k-tuples,ordering is performed by their unambiguous connection via overlappingends. The informational basis for unambiguous connection is the divisionof subfragments generated in fragments of the basic library into groupsrepresenting segments of those fragments. The method is analogous to thebiochemical solution of this problem based on hybridization with longeroligonucleotides with relevant connecting sequence. The connectingsequences are generated as subfragments using the k-tuple sets of theappropriate segments of basic library fragments. Relevant segments aredefined by the fragments of the ordering library that overlap with therespective fragments of the basic library. The shortest segments areinformative fragments of the ordering library. The longer ones areseveral neighboring informative fragments or total overlapping portionsof fragments corresponding of the ordering and basic libraries. In orderto decrease the number of separate samples, fragments of the orderinglibrary are randomly pooled, and the unique k-tuple content isdetermined.

[0174] By using the large number of fragments in the ordering libraryvery short segments are generated, thus reducing the chance of themultiple appearance of the k−1 sequences which are the reasons forgeneration of the subfragments. Furthermore, longer segments, consistingof the various regions of the given fragment of the basic library, donot contain some of the repeated k−1 sequences. In every segment aconnecting sequence (a connecting subfragment) is generated for acertain pair of the subfragments from the given fragment. The process ofordering consists of three steps: (1) generation of the k-tuple contentsof each segment; (2) generation of subfragments in each segment; and (3)connection of the subfragments of the segments. Primary segments aredefined as significant intersections and differences of k-tuple contentsof a given fragment of the basic library with the k-tuple contents ofthe pools of the ordering library. Secondary (shorter) segments aredefined as intersections and differences of the k-tuple contents of theprimary segments.

[0175] There is a problem of accumulating both false positive andnegative k-tuples in both the differences and intersections. The falsenegative k-tuples from starting sequences accumulate in theintersections (overlapping parts), as well as false positive k-tuplesoccurring randomly in both sequences, but not in the relevantoverlapping region. On the other hand, the majority of false positivesfrom either of the starting sequences is not taken up intointersections. This is an example of the reduction of experimentalerrors from individual fragments by using information from fragmentsoverlapping with them. The false k-tuples accumulate in the differencesfor another reason. The set of false negatives from the originalsequences are enlarged for false positives from intersections and theset of false positives for those k-tuples which are not included in theintersection by error, i.e. are false negative in the intersection. Ifthe starting sequences contain 10% false negative data, the primary andsecondary intersections will contain 19% and 28% false negativek-tuples, respectively. On the other hand, a mathematical expectation of77 false positives may be predicted if the basic fragment and the poolshave lengths of 500 bp and 10,000 bp, respectively. However, there is apossibility of recovering most of the “lost” k-tuples and of eliminatingmost of the false positive k-tuples.

[0176] First, one has to determine a basic content of the k-tuples for agiven segment as the intersection of a given pair of the k-tuplecontents. This is followed by including all k-tuples of the startingk-tuple contents in the intersection, which contain at one end k−1 andat the other end k−+ sequences which occur at the ends of two k-tuplesof the basic set. This is done before generation of the differences thuspreventing the accumulation of false positives in that process.Following that, the same type of enlargement of k-tuple set is appliedto differences with the distinction that the borrowing is from theintersections. All borrowed k-tuples are eliminated from theintersection files as false positives.

[0177] The intersection, i.e. a set of common k-tuples, is defined foreach pair (a basic fragment)×(a pool of ordering library). If the numberof k-tuples in the set is significant it is enlarged with the falsenegatives according to the described rule. The primary difference set isobtained by subtracting from a given basic fragment the obtainedintersection set. The false negative k-tuples are appended to thedifference set by borrowing from the intersection set according to thedescribed rule and, at the same time, removed from the intersection setas false positive k-tuples. When the basic fragment is longer than thepooled fragments, this difference can represent the two separatesegments which somewhat reduces its utility in further steps. Theprimary segments are all generated intersections and differences ofpairs (a basic fragment)×(a pool of ordering library) containing thesignificant number of k-tuples. K-tuple sets of secondary segments areobtained by comparison of k-tuple sets of all possible pairs of primarysegments. The two differences are defined from each pair which producesthe intersection with the significant number of k-tuples. The majorityof available information from overlapped fragments is recovered in thisstep so that there is little to be gained from the third round offorming intersections, and differences.

[0178] (2) Generation of the subfragments of the segments is performedidentically as described for the fragments of the basic library.

[0179] (3) The method of connection of subfragments consists ofsequentially determining the correctly linked pairs of subfragmentsamong the subfragments from a given basic library fragment which havesome overlapped ends. In the case of 4 relevant subfragments, two ofwhich contain the same beginning and two having the same end, there are4 different pairs of subfragments that can be connected. In general 2are correct and 2 are wrong. To find correct ones, the presence of theconnecting sequences of each pair is tested in the subfragmentsgenerated from all primary and secondary segments for a given basicfragment. The length and the position of the connecting sequence arechosen to avoid interference with sequences which occur by chance. Theyare k+2 or longer, and include at least one element 2 beside overlappingsequence in both subfragments of a given pair. The connection ispermitted only if the two connecting sequences are found and theremaining two do not exist. The two linked subfragments replace formersubfragments in the file and the process is cyclically repeated.

[0180] Repeated sequences are generated in this step. This means thatsome subfragments are included in linked subfragments more than once.They will be recognized by finding the relevant connecting sequencewhich engages one subfragment in connection with two differentsubfragments.

[0181] The recognition of misconnected subfragments generated in theprocesses of building pSFs and merging pSFs into longer subfragments isbased on testing whether the sequences of subfragments from a givenbasic fragment exist in the sequences of subfragments generated in thesegments for the fragment. The sequences from an incorrectly connectedposition will not be found indicating the misconnected subfragments.

[0182] Beside the described three steps in ordering of subfragments someadditional control steps or steps applicable to specific sequences willbe necessary for the generation of more complete sequence withoutmistakes.

[0183] The determination of which subfragment belongs to which segmentis performed b comparison of contents of k-tuples in segments andsubfragments. Because of the errors in the k-tuple contents (due to theprimary error in pools and statistical errors due to the frequency ofoccurrences of k-tuples) the exact partitioning of subfragments isimpossible. Thus, instead of “all or none” partition, the chance ofcoming from the given segment (P(sf,s)) is determined for eachsubfragment. This possibility is the function of the lengths ofk-tuples, the lengths of subfragments, the lengths of fragments ofordering library, the size of the pool, and of the percentage of falsek-tuples in the file:

P(sf,s)=(Ck−F)/Lsf,

[0184] where Lsf is the length of subfragment, Ck is the number ofcommon k-tuples for a given subfragment/segment pair, and F is theparameter that includes relations between lengths of k-tuples, fragmentsof basic library, the size of the pool, and the error percentage.

[0185] Subfragments attributed to a particular segment are treated asredundant short pSFs and are submitted to a process of unambiguousconnection. The definition of unambiguous connection is slightlydifferent in this case, since it is based on a probability thatsubfragments with overlapping end(s) belong to the segment considered.Besides, the accuracy of unambiguous connection is controlled byfollowing the connection of these subfragments in other segments. Afterthe connection in different segments, all of the obtained subfragmentsare merged together, shorter subfragments included within longer onesare eliminated, and the remaining ones are submitted to the ordinaryconnecting process. If the sequence is not regenerated completely, theprocess of partition and connection of subfragments is repeated with thesame or less severe criterions of probability of belonging to theparticular segment, followed by unambiguous connection.

[0186] Using severe criteria for defining unambiguous overlap, someinformation is not used. Instead of a complete sequence, severalsubfragments that define a number of possibilities for a given fragmentare obtained. Using less severe criteria an accurate and completesequence is generated. In a certain number of situations, e.g. anerroneous connection, it is possible to generate a complete, but anincorrect sequence, or to generate “monster” subfragments with noconnection among them. Thus, for each fragment of the basic library oneobtains: a) several possible solutions where one is correct and b) themost probable correct solution. Also, in a very small number of cases,due to the mistake in the subfragment generation process or due to thespecific ratio of the probabilities of belonging, no unambiguoussolution is generated or one, the most probable solution. These casesremain as incomplete sequences, or the unambiguous solution is obtainedby comparing these data with other, overlapped fragments of basiclibrary.

[0187] The described algorithm was tested on a randomly generated, 50 kbsequence, containing 40% GC to simulate the GC content of the humangenome. In the middle part of this sequence were inserted various All,and some other repetitive sequences, of a total length of about 4 kb. Tosimulate an in vitro SBH experiment, the following operations wereperformed to prepare appropriate data.

[0188] Positions of sixty 5 kb overlapping “clones” were randomlydefined, to simulate preparation of a basic library:

[0189] Positions of one thousand 500 bp “clones” were randomlydetermined to simulate making the ordering library. These fragments wereextracted from the sequence. Random pools of 20 fragments were made, andk-tuple sets of pools were determined and stored on the hard disk. Thesedata are used in the subfragment ordering phase: For the same density ofclones 4 million clones in basic library and 3 million clones inordering library are used for the entire human genome. The total numberof 7 million clones is several fold smaller than the number of clones afew kb long for random cloning of almost all of genomic DNA andsequencing by a gel-based method.

[0190] From the data on the starts and ends of 5 kb fragments, 117“informative fragments” were determined to be in the sequence. This wasfollowed by determination of sets of overlapping k-tuples of which thesingle “informative fragment” consist. Only the subset of k-tuplesmatching a predetermined list were used. The list contained 65% 8-mers,30% 9-mers; and 5% 10-12-mers. Processes of generation and the orderingof subfragments were performed on these data.

[0191] The testing of the algorithm was performed on the simulated datain two experiments. The sequence of 50 informative fragments wasregenerated with the 100% correct data set (over 20,000 bp), and 26informative fragments (about 10,000 bp) with 10% false k-tuples (5%positive and 5% negative ones).

[0192] In the first experiment, all subfragments were correct and inonly one out of 50 informative fragments the sequence was not completelyregenerated but remained in the form of 5 subfragments. The analysis ofpositions of overlapped fragments of ordering library has shown thatthey lack the information for the unique ordering of the 5 subfragments.The subfragments may be connected in two ways based on overlapping ends,1-2-3-4-5 and 1-4-3-2-5. The only difference is the exchange ofpositions of subfragments 2 and 4. Since subfragments 2, 3, and 4 arerelatively short (total of about 100 bp), the relatively greater chanceexisted, and occurred in this case, that none of the fragments ofordering library started or ended in the subfragment 3 region.

[0193] To simulate real sequencing, some false (“hybridization”) datawas included as input in a number of experiments. In oligomerhybridization experiments, under proposed conditions, the only situationproducing unreliable data is the end mismatch versus full matchhybridization. Therefore, in simulation only those k-tuples differing ina single element on either end from the real one were considered to befalse positives. These “false” sets are made as follows. On the originalset of a k-tuples of the informative fragment, a subset of 5% falsepositive k-tuples are added. False positive k-tuples are made byrandomly picking a k-tuple from the set, copying it and altering anucleotide on its beginning or end. This is followed by subtraction of asubset of 5% randomly chosen k-tuples. In this way the statisticallyexpected number of the most complicated cases is generated in which thecorrect k-tuple is replaced with a k-tuple with the wrong base on theend.

[0194] Production of k-tuple sets as described leads to up to 10% offalse data. This value varies from case to case, due to the randomnessof choice of k-tuples to be copied, altered, and erased. Nevertheless,this percentage 3-4 times exceeds the amount of unreliable data in realhybridization experiments. The introduced error of 10% leads to the twofold increase in the number of subfragments both in fragments of basiclibrary (basic library informative fragments) and in segments. About 10%of the final subfragments have a wrong base at the end as expected forthe k-tuple set which contains false positives (see generation ofprimary subfragments). Neither the cases of misconnection ofsubfragments nor subfragments with the wrong sequence were observed. In4 informative fragments out of 26 examined in the ordering process thecomplete sequence was not regenerated. In all 4 cases the sequence wasobtained in the form of several longer subfragments and several shortersubfragments contained in the same segment. This result shows that thealgorithmic principles allow working with a large percentage of falsedata.

[0195] The success of the generation of the sequence from its k-tuplecontent may be described in terms of completeness and accuracy. In theprocess of generation, two particular situations can be defined: 1) Somepart of the information is missing in the generated sequence, but oneknows where the ambiguities are and to which type they belong, and 2)the regenerated sequence that is obtained does not match the sequencefrom which the k-tuple content is generated, but the mistake can not bedetected. Assuming the algorithm is developed to its theoretical limits,as in the use of the exact k-tuple sets, only the first situation cantake place. There the incompleteness results in a certain number ofsubfragments that may not be ordered unambiguously and the problem ofdetermination of the exact length of monotonous sequences, i.e. thenumber of perfect tandem repeats.

[0196] With false k-tuples, incorrect sequence may be generated. Thereason for mistakes does not lie in the shortcomings of the algorithm,but in the fact that a given content of k-tuples unambiguouslyrepresents the sequence that differs from the original one. One maydefine three classes of error, depending on the kind of the falsek-tuples present in the file. False negative k-tuples (which are notaccompanied with the false positives) produce “deletions”. Falsepositive k-tuples are producing “elongations (unequal crossing over)”.False positives accompanied with false negatives are the reason forgeneration of “insertions”, alone or combined with “deletions”. Thedeletions are produced when all of the k-tuples (or their majority)between two possible starts of the subfragments are false negatives.Since every position in the sequence is defined by k. k-tuples, theoccurrence of the deletions in a common case requires k consecutivefalse negatives. (With 10% of the false negatives and k=8, thissituation takes place after every 108 elements). This situation isextremely infrequent even in mammalian genome sequencing using randomlibraries containing ten genome equivalents.

[0197] Elongation of the end of the sequence caused by false positivek-tuples is the special case of “insertions”, since the end of thesequence can be considered as the endless linear array of false negativek-tuples. One may consider a group of false positive k-tuples producingsubfragments longer than one k-tuple. Situations of this kind may bedetected if subfragments are generated in overlapped fragments, likerandom physical fragments of the ordering library. An insertion, orinsertion in place of a deletion, can arise as a result of specificcombinations of false positive and false negative k-tuples. In the firstcase, the number of consecutive false negatives is smaller than k. Bothcases require several overlapping false positive k-tuples. Theinsertions and deletions are mostly theoretical possibilities withoutsizable practical repercussions since the requirements in the number andspecificity of false k-tuples are simply too high.

[0198] In every other situation of no meeting the theoreticalrequirement of the minimal number an the kind of the false positiveand/or negatives, mistakes in the k-tuples content may produce only thelesser completeness of a generated sequence.

1. A method for analyzing nucleic acids by hybridization, comprising thesteps of: arraying a first plurality of nucleic acid segments on a firstsector of a substrate; disposing a second plurality of nucleic acidsegments on a second sector of said substrate; exposing, underconditions discriminating between full complementarity and a one basemismatch, said first plurality of nucleic acid segments to a firsthybridization probe in said first sector, said first hybridization probebeing shorter than one from among said first plurality of nucleic acidsegments, to said plurality of nucleic acid segments; incubating underconditions discriminating between full complementarity and a one basemismatch, a second hybridization probe in said second sector, saidsecond hybridization probe being shorter than a segment from among saidsecond plurality of nucleic acid segments and said second hybridizationprobe being different in sequence from said first hybridization probe;detecting hybridization of a hybridization probe to a nucleic acidsegment; and analyzing the result.
 2. The method as recited in claim 1,further comprising, prior to said disposing step, the step ofintroducing a barrier to movement of a nucleic acid.
 3. The method asrecited in claim 1 further comprising, after said arraying and saiddisposing step but before said incubating step, the step of introducinga barrier to movement of a nucleic acid.
 4. The method as recited inclaim 3 wherein said introducing step comprises pressing a physicalbarrier against said substrate.
 5. The method as recited in claim 2wherein said introducing step comprises the step of applying adirection-switching electrical field perpendicular to said support toprevent the mixing of probes between sectors.
 6. The method as recitedin claim 3 wherein said introducing step comprises the step of applyinga direction-switching electrical field perpendicular to said support toprevent the mixing of probes between sectors.
 7. The method as recitedin claim 1 wherein said arraying step comprises the step of spottingnucleic acid samples by means of a pin array.
 8. The method as recitedin claim 1 wherein said arraying step comprises the step of dispensingnucleic acid samples by an array of tubes.
 9. The method as recited inclaim 1 wherein said arraying step comprises the step of jet printingnucleic acid samples.
 10. The method as recited in claim 1 wherein saidexposing step comprises the step of applying a plurality of contiguouslyhybridizing probes.
 11. The method as recited in claim 1 wherein saidincubating step comprises the step of applying a plurality ofcontiguously hybridizing probes.
 12. The method as recited in claim 10further comprising the step of ligating at least two of said pluralityof contiguously hybridizing probes.
 13. The method as recited in claim11 further comprising the step of ligating at least two of saidplurality of contiguously hybridizing probes.
 14. The method as recitedin claim 1 wherein said exposing step comprises the step of applying aplurality of competitively hybridizing probes having overlapping nucleicacid sequences.
 15. The method as recited in claim 1 wherein saidincubating step comprises the step of applying a plurality ofcompetitively hybridizing probes having overlapping nucleic acidsequences.
 16. The method as recited in claim 1 wherein a least two ofsaid first plurality of nucleic acid segments are arrayed as a mixture.17. The method as recited in claim 1 wherein a least two of said secondplurality of nucleic acid segments are disposed as a mixture.
 18. Themethod as recited in claim 1 further comprising the steps of preparingsamples by digestion with an Hga I type restriction enzyme and ligatingthe resulting restriction fragments with an anchor.
 19. The method asrecited in claim 1 further comprising the step of selecting probes froma universal set of probes of a given length.
 20. The method as recitedin claim 1 further comprising the step of selecting probes from anincomplete set of probes of a given length.
 21. The method as recited inclaim 1 further comprising the step of selecting deoxyribonucleotideprobes.
 22. The method as recited in claim 1 further comprising the stepof selecting ribonucleotide probes.
 23. The method as recited in claim 1further comprising the step of selecting a nucleic acid analog selectedfrom the group consisting of protein nucleic acid probes and probescontaining base analogs.
 24. The method as recited in claim 1 furthercomprising the step of multiplex labelling of probes.
 25. The method asrecited in claim 1 further comprising the step of degrading a label onan unhybridized probe.
 26. The method as recited in claim 19 whereinsaid exposing or said incubating step comprises the step of assembling aset of universal probes 6, 7, 8, 9 or 10 bases in length.
 27. The methodas recited in claim 19 wherein said exposing or said incubating stepcomprises the step of assembling a set of universal probes 6, 7, 8, 9 or10 bases in length.
 28. The method as recited in claim 20 wherein saidexposing or said incubating step comprises the step of assembling anincomplete set of probes 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 bases in length.29. Apparatus analyzing nucleic acids by hybridization comprising asubstrate having points of attachment for nucleic acid fragments, saidsubstrate being segmented by hydrophobic regions.
 30. The method asrecited in claim 20 wherein said disposing step comprises the step ofassembling an incomplete set of probes 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30bases in length.
 31. The method of claim 1 further comprising the stepof confirming the relative order of at least two bases in a segment bydetecting hybridization of two or more probes having overlapping nucleicacid sequences including said at least two bases.
 32. A method fornucleotide sequence analysis comprising the steps of: introducing asample to an array of probes; adjusting the temperature to be one atwhich a majority of sample molecules are unassociated with ligatedprobes at any given time; adding a labelled probe to the mixture;incubating the mixture with ligase; removing free probes; and detectingligation products.
 33. The method as recited in claim 1 furthercomprising the steps of defining additional probes for improving adesired result and repeating said exposing, incubating, detecting andanalyzing steps.
 34. The method as recited in claim 1 further comprisingthe step of stripping the substrate of probes for reuse of saidpluralities of nucleic acid segments.